cdp_data package¶
Subpackages¶
Submodules¶
cdp_data.constants module¶
cdp_data.datasets module¶
- cdp_data.datasets.convert_transcript_to_dataframe(transcript: str | Path | Transcript) DataFrame [source]¶
Create a dataframe from only the sentence data from the provided transcript.
- Parameters:
- transcript: Union[str, Path, Transcript]
The transcript to pull all sentences from.
- Returns:
- pd.DataFrame:
The sentences of the transcript.
- cdp_data.datasets.get_session_dataset(infrastructure_slug: str, start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, sample: int | float | None = None, replace_py_objects: bool = False, store_full_metadata: bool = False, store_transcript: bool = False, transcript_selection: str = 'created', store_transcript_as_csv: bool = False, store_video: bool = False, store_audio: bool = False, cache_dir: str | Path | None = None, raise_on_error: bool = True, tqdm_kws: Dict[str, Any] | None = None) DataFrame [source]¶
Get a dataset of sessions from a CDP infrastructure.
- Parameters:
- infrastructure_slug: str
The CDP infrastructure to connect to and pull sessions for.
- start_datetime: Optional[Union[str, datetime]]
An optional datetime that the session dataset will start at. Default: None (no datetime beginning bound on the dataset)
- end_datetime: Optional[Union[str, datetime]]
An optional datetime that the session dataset will end at. Default: None (no datetime end bound on the dataset)
- sample: Optional[Union[int, float]]
An optional sample of the dataset to return. If an int, the number of rows to return. If a float, the percentage of rows to return. Default: None (return all rows)
- replace_py_objects: bool
Replace any non-standard Python type with standard ones to allow the returned data be ready for storage. See ‘See Also’ for more details. Default: False (keep Python objects in the DataFrame)
- store_full_metadata: bool
Should a JSON file of the full event metadata be stored to disk and a path to the stored JSON file be added to the returned DataFrame. Default: False (do not request extra data and store to disk) Currently not implemented
- store_transcript: bool
Should a session transcript be requested and stored to disk and a path to the stored transcript JSON file be added to the returned DataFrame. Default: False (do not request extra data and do not store the transcript)
- transcript_selection: str
How should the single transcript be selected. Default: “created” (Return the most recently created transcript per session)
- store_transcript_as_csv: bool
Additionally convert and store all transcripts as CSVs. Does nothing if store_transcript is False. Default: False (do not convert and store again)
- store_video: bool
Should the session video be requested and stored to disk and a path to the stored video file be added to the returned DataFrame. Note: the video is stored without a file extension. However, the video with always be either mp4 or webm. Default: False (do not request and store the video)
- store_audio: bool
Should the session audio be requested and stored to disk and a path to the stored audio file be added to the returned DataFrame. Default: False (do not request and store the audio)
- cache_dir: Optional[Union[str, Path]]
An optional directory path to cache the dataset. Directory is created if it does not exist. Default: “./cdp-datasets”
- raise_on_error: bool
Should any failure to pull files result in an error or be ignored. Default: True (raise on any failure)
- tqdm_kws: Dict[str, Any]
A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.
- Returns:
- dataset: pd.DataFrame
The dataset with all additions requested.
See also
replace_dataframe_cols_with_storage_replacements
The function used to clean the data of non-standard Python types.
Notes
All file additions (transcript, full event metadata, video, audio, etc.) are cached to disk to avoid multiple downloads. If you use the same cache directory multiple times over the course of multiple runs, no new data will be downloaded, but the existing files will be used. Caching is done by simply file existence not by a content hash comparison.
Datasets are cached with the following structure:
{cache-dir}/ └── {infrastructure_slug} ├── event-{event-id-0} │ ├── metadata.json │ └── session-{session-id-0} │ ├── audio.wav │ ├── transcript.json │ └── video ├── event-{event-id-1} │ ├── metadata.json │ └── session-{session-id-0} │ ├── audio.wav │ ├── transcript.json │ └── video ├── event-{event-id-2} │ ├── metadata.json │ └── session-{session-id-0} │ ├── audio.wav │ ├── transcript.json │ └── video │ └── session-{session-id-1} │ ├── audio.wav │ ├── transcript.json │ └── video
To clean a whole dataset or specific events or sessions simply delete the associated directory.
- cdp_data.datasets.get_vote_dataset(infrastructure_slug: str, start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, replace_py_objects: bool = False, tqdm_kws: Dict[str, Any] | None = None) DataFrame [source]¶
Get a dataset of votes from a CDP infrastructure.
- Parameters:
- infrastructure_slug: str
The CDP infrastructure to connect to and pull votes for.
- start_datetime: Optional[Union[str, datetime]]
An optional datetime that the vote dataset will start at. Default: None (no datetime beginning bound on the dataset)
- end_datetime: Optional[Union[str, datetime]]
An optional datetime that the vote dataset will end at. Default: None (no datetime end bound on the dataset)
- replace_py_objects: bool
Replace any non-standard Python type with standard ones to allow the returned data be ready for storage. See ‘See Also’ for more details. Default: False (keep Python objects in the DataFrame)
- tqdm_kws: Dict[str, Any]
A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.
- Returns:
- pd.DataFrame
The dataset requested.
See also
replace_dataframe_cols_with_storage_replacements
The function used to clean the data of non-standard Python types.
- cdp_data.datasets.replace_dataframe_cols_with_storage_replacements(df: DataFrame) DataFrame [source]¶
Run various replacement functions against the dataframe to get the data to a point where it can be store to disk.
- Parameters:
- df: pd.DataFrame
The data to fix.
- Returns:
- pd.DataFrame
The updated DataFrame.
See also
replace_db_model_cols_with_id_cols
Function to replace database model column values with their ids.
replace_pathlib_path_cols_with_str_path_cols
Function to replace pathlib Path column values with normal Python strings.
- cdp_data.datasets.replace_db_model_cols_with_id_cols(df: DataFrame) DataFrame [source]¶
Replace all database model column values with the model ID.
Example: an event column with event models, will be replaced by an event_id column with just the event id.
- Parameters:
- df: pd.DataFrame
The data to replace database models with just ids.
- Returns:
- pd.DataFrame
The updated DataFrame.
- cdp_data.datasets.replace_pathlib_path_cols_with_str_path_cols(df: DataFrame) DataFrame [source]¶
Replace all pathlib Path column values with string column values.
Example: a transcript_path column with a pathlib Path, will be replaced as a normal Python string.
- Parameters:
- df: pd.DataFrame
The data to replace paths in.
- Returns:
- pd.DataFrame
The updated DataFrame.
- cdp_data.datasets.save_dataset(df: DataFrame, dest: str | Path) Path [source]¶
Helper function to store a dataset to disk by replacing non-standard Python types with storage ready replacements.
- Parameters:
- df: pd.DataFrame
The DataFrame to store.
- dest: Union[str, Path]
The path to store the data. Must end in “.csv” or “.parquet”.
- Returns:
- Path:
The path to the stored data.
See also
replace_dataframe_cols_with_storage_replacements
The function used to replace column values.
cdp_data.instances module¶
- class cdp_data.instances.CDPInstances[source]¶
Bases:
object
Container for CDP instance infrastructure slugs.
Examples
>>> from cdp_data import datasets, CDPInstances ... ds = datasets.get_session_dataset( ... infrastructure_slug=CDPInstances.Seattle ... )
- Alameda = 'cdp-alameda-d3dabe54'¶
- Albuquerque = 'cdp-albuquerque-1d29496e'¶
- Atlanta = 'cdp-atlanta-37e7dd70'¶
- Boston = 'cdp-boston-c384047b'¶
- Charlotte = 'cdp-charlotte-98a7c348'¶
- Denver = 'cdp-denver-962aefef'¶
- KingCounty = 'cdp-king-county-b656c71b'¶
- LongBeach = 'cdp-long-beach-49323fe9'¶
- Louisville = 'cdp-louisville-6fd32a38'¶
- Milwaukee = 'cdp-milwaukee-9f60e352'¶
- Missoula = 'missoula-council-data-proj'¶
- MountainView = 'cdp-mountain-view-7c8a47df'¶
- Oakland = 'cdp-oakland-ba81c097'¶
- Portland = 'cdp-portland-d2bbda97'¶
- Richmond = 'cdp-richmond-a3d06941'¶
- SanJose = 'cdp-san-jose-5d9db455'¶
- Seattle = 'cdp-seattle-21723dcf'¶
cdp_data.keywords module¶
- cdp_data.keywords.compute_ngram_usage_history(infrastructure_slug: str | List[str], ngram_size: int = 1, strict: bool = False, start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, cache_dir: str | Path | None = None, raise_on_error: bool = True, tqdm_kws: Dict[str, Any] | None = None) DataFrame [source]¶
Pull the minimal data needed for a session dataset for the provided infrastructure and start and end datetimes, then compute the ngram usage history DataFrame.
- Parameters:
- infrastructure_slug: str
The CDP infrastructure(s) to connect to and pull sessions for.
- ngram_size: int
The ngram size to use for counting and calculating usage. Default: 1 (unigrams)
- strict: bool
Should all ngrams be stemmed or left unstemmed for a more strict usage history. Default: False (stem and clean all grams in the dataset)
- start_datetime: Optional[Union[str, datetime]]
An optional datetime that the session dataset will start at. Default: None (no datetime beginning bound on the dataset)
- end_datetime: Optional[Union[str, datetime]]
An optional datetime that the session dataset will end at. Default: None (no datetime end bound on the dataset)
- cache_dir: Optional[Union[str, Path]]
An optional directory path to cache the dataset. Directory is created if it does not exist. Default: “./cdp-datasets”
- raise_on_error: bool
Should any failure to pull files result in an error or be ignored. Default: True (raise on any failure)
- tqdm_kws: Dict[str, Any]
A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.
- Returns:
- gram_usage: pd.DataFrame
A pandas DataFrame of all found ngrams (stemmed and cleaned or unstemmed and uncleaned) from the data and their counts for each session and their percentage of use as a percent of their use for the day over the sum of all other ngrams used that day.
See also
cdp_data.datasets.get_session_dataset
Function to pull or load a cached session dataset.
cdp_data.plotting.plot_ngram_usage_histories
Plot ngram usage history data.
Notes
This function calculates the counts and percentage of each ngram used for a day over the sum of all other ngrams used in that day’s discussion(s). This is close but not exactly the same as Google’s NGram Viewer: https://books.google.com/ngrams
This function will pull a new session dataset and cache transcripts to the local disk in the provided (or default) cache directory.
It is recommended to cache this dataset after computation because it may take a while depending on machine resources and available.
- cdp_data.keywords.compute_query_semantic_similarity_history(query: str | List[str], infrastructure_slug: str | List[str], start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, cache_dir: str | Path | None = None, embedding_model: str = 'msmarco-distilbert-base-v4', raise_on_error: bool = True, tqdm_kws: Dict[str, Any] | None = None) DataFrame [source]¶
Compute the semantic similarity of a query against every sentence of every meeting. The max, min, and mean semantic similarity of each meeting will be returned.
- Parameters:
- query: Union[str, List[str]]
The query(ies) to compare each sentence against.
- infrastructure_slug: Union[str, List[str]]
The CDP infrastructure(s) to connect to and pull sessions for.
- start_datetime: Optional[Union[str, datetime]]
The earliest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.
- end_datetime: Optional[Union[str, datetime]]
The latest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.
- cache_dir: Optional[Union[str, Path]]
An optional directory path to cache the dataset. Directory is created if it does not exist. Default: “./cdp-datasets”
- embedding_model: str
The sentence transformers model to use for embedding the query and each sentence. Default: “msmarco-distilbert-base-v4” All embedding models are available here: https://www.sbert.net/docs/pretrained-models/msmarco-v3.html Select any of the “Models tuned for cosine-similarity”.’
- raise_on_error: bool
Should any failure to pull files result in an error or be ignored. Default: True (raise on any failure)
- tqdm_kws: Dict[str, Any]
A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.
- Returns:
- pd.DataFrame
The min, max, and mean semantic similarity for each event as compared to the query for the events within the datetime range.
Notes
This function requires additional dependencies. Install extra requirements with: pip install cdp-data[transformers].
- cdp_data.keywords.fill_history_data_with_zeros(data: DataFrame, ngram_col: str, dt_col: str) DataFrame [source]¶
A utility function to fill ngram history data with zeros for all missing dates.
- Parameters:
- data: pd.DataFrame
The ngram history data to fill dates for.
- ngram_col: str
The column name for which the “ngram” is stored.
- dt_col: str
The column name for which the datetime is stored.
- Returns:
- data: pd.DataFrame
A DataFrame filled with the original data and filled in with any missing dates with their values being set to zero.
See also
cdp_data.plotting.prepare_ngram_history_plotting_data
Subsets to plotting only columns and ensures values are sorted and grouped.
- cdp_data.keywords.get_ngram_relevancy_history(ngram: str, strict: bool = False, start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, infrastructure_slug: str | None = None) DataFrame [source]¶
Pull an n-gram’s relevancy history from a CDP database.
- Parameters:
- ngram: str
The unigram, bigram, or trigram to retrieve history for.
- strict: bool
Should the provided ngram be used for a strict “unstemmed_gram” query or not. Default: False (stem and clean the ngram before querying)
- start_datetime: Optional[Union[str, datetime]]
The earliest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.
- end_datetime: Optional[Union[str, datetime]]
The latest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.
- infrastructure_slug: Optional[str]
The optional CDP infrastructure slug to connect to. Default: None (you are managing the database connection yourself)
- Returns:
- ngram_history: pd.DataFrame
A pandas DataFrame of all IndexedEventGrams that match the provided ngram query (stemmed or unstemmed).
See also
cdp_data.keywords.compute_ngram_usage_history
Compute all ngrams usage history for a specific CDP session dataset. Useful for comparing how much discussion is comprised of specific ngrams.
Notes
This function pulls the TF-IDF (or other future indexed values) score for the provided ngram over time. This is a measure of relevancy to a document and not the same as Google’s NGram Viewer which shows what percentage of literature used the term.
cdp_data.plots module¶
Module contents¶
Top-level package for cdp_data.
- class cdp_data.CDPInstances[source]¶
Bases:
object
Container for CDP instance infrastructure slugs.
Examples
>>> from cdp_data import datasets, CDPInstances ... ds = datasets.get_session_dataset( ... infrastructure_slug=CDPInstances.Seattle ... )
- Alameda = 'cdp-alameda-d3dabe54'¶
- Albuquerque = 'cdp-albuquerque-1d29496e'¶
- Atlanta = 'cdp-atlanta-37e7dd70'¶
- Boston = 'cdp-boston-c384047b'¶
- Charlotte = 'cdp-charlotte-98a7c348'¶
- Denver = 'cdp-denver-962aefef'¶
- KingCounty = 'cdp-king-county-b656c71b'¶
- LongBeach = 'cdp-long-beach-49323fe9'¶
- Louisville = 'cdp-louisville-6fd32a38'¶
- Milwaukee = 'cdp-milwaukee-9f60e352'¶
- Missoula = 'missoula-council-data-proj'¶
- MountainView = 'cdp-mountain-view-7c8a47df'¶
- Oakland = 'cdp-oakland-ba81c097'¶
- Portland = 'cdp-portland-d2bbda97'¶
- Richmond = 'cdp-richmond-a3d06941'¶
- SanJose = 'cdp-san-jose-5d9db455'¶
- Seattle = 'cdp-seattle-21723dcf'¶