cdp_scrapers package¶
- cdp_scrapers.instances package
- Submodules
- cdp_scrapers.instances.atlanta module
- cdp_scrapers.instances.empty module
- cdp_scrapers.instances.houston module
- cdp_scrapers.instances.kingcounty module
- cdp_scrapers.instances.lacity module
- cdp_scrapers.instances.portland module
- cdp_scrapers.instances.seattle module
- Module contents
cdp_scrapers.legistar_content_parsers module¶
cdp_scrapers.legistar_utils module¶
- class cdp_scrapers.legistar_utils.ContentUriScrapeResult(status, uris)[source]¶
Create new instance of ContentUriScrapeResult(status, uris)
- class Status(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Status of content parsing.
- ContentNotProvidedError = -3¶
- Ok = 0¶
- ResourceAccessError = -2¶
- UnrecognizedPatternError = -1¶
- uris: list[ContentURIs] | None¶
Alias for field number 1
- class cdp_scrapers.legistar_utils.LegistarScraper(client: str, timezone: str, ignore_minutes_item_patterns: list[str] | None = None, vote_approve_pattern: str = 'approve|favor|yes', vote_abstain_pattern: str = 'abstain|refuse|refrain', vote_reject_pattern: str = 'reject|oppose|no', vote_absent_pattern: str = 'absent', vote_nonvoting_pattern: str = 'nv|(?:non.*voting)', matter_adopted_pattern: str = 'approved|confirmed|passed|adopted|consent|(?:voted.*com+it+ee)', matter_in_progress_pattern: str = 'heard|read|filed|held|(?:in.*com+it+ee)', matter_rejected_pattern: str = 'rejected|dropped', minutes_item_decision_passed_pattern: str = 'pass', minutes_item_decision_failed_pattern: str = 'not|fail', static_data: ScraperStaticData | None = None, person_aliases: dict[str, set[str]] | None = None, role_replacements: dict[str, str] | None = None)[source]¶
Base class for transforming Legistar API data to CDP IngestionModel.
If get_events() naively fails and raises an error, a given installation must define a derived class and implement the get_content_uris() function.
- Parameters:
- client: str
Legistar client name, e.g. “seattle” for Seattle, “kingcounty” for King County.
- timezone: str
The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See for canonical timezones.
- ignore_minutes_item_patterns: List[str]
A list of string patterns or substrings to act as a minutes item filter. Any item in the provided list will be compiled as a regex string and any minute’s item that contains the compiled pattern will be filtered out of the produced CDP minutes item list. Default: [] (do not filter any minutes items)
- vote_approve_pattern: str
Regex pattern used to convert Legistar instance’s votes in approval value to CDP constant value. Default: “approve|favor|yes”
- vote_abstain_pattern: str
Regex pattern used to convert Legistar instance’s abstension value to CDP constant value. Note, this is a pure abstension, not an “approval by abstention” or “rejection by abstension” value. Those should be places in vote_approve_pattern and vote_reject_pattern respectively. Default: “abstain|refuse|refrain”
- vote_reject_pattern: str
Regex pattern used to convert Legistar instance’s votes in rejection value to CDP constant value. Default: “reject|oppose|no”
- vote_absent_pattern: str
Regex pattern used to convert Legistar instance’s excused absense value to CDP constant value. Default: “absent”
- vote_nonvoting_pattern: str
Regex pattern used to convert Legistar instance’s non-voting value to CDP constant value. Default: “nv|(?:non.*voting)”
- matter_adopted_pattern: str
Regex pattern used to convert Legistar instance’s matter was adopted to CDP constant value. Default: “approved|confirmed|passed|adopted”
- matter_in_progess_pattern: str
Regex pattern used to convert Legistar instance’s matter is in-progress to CDP constant value. Default: “heard|ready|filed|held|(?:ins*committee)”
- matter_rejected_pattern: str
Regex pattern used to convert Legistar instance’s matter was rejected to CDP constant value. Default: “rejected|dropped”
- minutes_item_decision_passed_pattern: str
Regex pattern used to convert Legistar instance’s minutes item passage to CDP constant value. Default: “pass”
- minutes_item_decision_failed_pattern: str
Regex pattern used to convert Legistar instance’s minutes item failure to CDP constant value. Default: “not|fail”
- static_data: Optional[ScraperStaticData]
predefined Seats, Bodies and Persons used to provide more accurate
- person_aliases: Optional[Dict[str, Set[str]]]
Dictionary used to catch name aliases and resolve improperly unique Persons to the one correct Person. Default: None
- role_replacements: Optional[Dict[str, str]]
Dictionary used to replace role titles with CDP standard role titles. The keys should be titles you want to replace and the values should be a CDP standard role. Default: None
See also
- check_for_cdp_min_ingestion(check_days: int = 7) bool [source]¶
Test if can obtain at least one minimally defined EventIngestionModel.
- Parameters:
- check_days: int, default=7
Test duration is the past check_days days from now
- Returns:
- minimum_ingestion_data_available: bool
True if got at least one minimally defined EventIngestionModel
- static date_and_time_to_datetime(ev_date: str, ev_time: str | None) datetime [source]¶
Return datetime from ev_date and ev_time.
- Parameters:
- ev_date: str
Formatted as “%Y-%m-%dT%H:%M:%S”
- ev_time: Optional[str]
Formatted as “%I:%M %p” Or None and do not attach time to date.
- Returns:
- datetime
date using ev_date and time using ev_time
- filter_event_minutes(ev_minutes_item: EventMinutesItem) EventMinutesItem | None [source]¶
Return None if contains unimportant text that we want to ignore.
- Parameters:
- ev_minutes_item: EventMinutesItem
The minutes item to filter.
- Returns:
- filtered_event_minutes_items: Optional[EventMinutesItem]
The allowed minutes item or None is filtered out.
- fix_event_minutes(ev_minutes_item: EventMinutesItem | None, legistar_ev_item: dict) EventMinutesItem | None [source]¶
Inspect the MinutesItem and Matter in ev_minutes_item. - Move some fields between them to make the information more meaningful. - Enforce matter.result_status when appropriate.
- Parameters:
- ev_minutes_item: Optional[EventMinutesItem]
The specific event minutes item to clean. Or None if running this function in a loop with multiple event minutes items and you don’t want to clean / the emi was filtered out.
- legistar_ev_item: Dict
The original Legistar EventItem.
- Returns:
- cleaned_emi: Optional[EventMinutesItem]
The cleaned event minutes item. This can clean both the event minutes item and the attached matter information.
- get_body(legistar_body: dict[str, Any]) Body | None [source]¶
Return CDP Body for Legistar body.
- Parameters:
- legistar_body: Dict
Legistar API body
- Returns:
- body: Optional[body]
The Legistar body converted to a CDP body ingestion model. None if missing required information.
See also
- get_content_uris(legistar_ev: dict) list[ContentURIs] [source]¶
Must implement in class derived from LegistarScraper. If Legistar Event.EventVideoPath is used, return an empty list in the override.
- Parameters:
- legistar_ev: Dict
Data for one Legistar Event.
- Returns:
- event_content_uris: List[ContentURIs]
List of ContentURIs objects for each session found.
- Raises:
- NotImplementedError
This base implementation does nothing
- get_event_minutes(legistar_ev_items: list[dict]) list[EventMinutesItem] | None [source]¶
Return List[EventMinutesItem] for Legistar API EventItems.
- Parameters:
- legistar_ev_items: List[Dict]
Legistar API EventItems
- Returns:
- event_minutes_items: Optional[List[EventMinutesItem]]
Filtered set of event minutes items.
- get_event_supporting_files(legistar_ev_attachments: list[dict]) list[SupportingFile] | None [source]¶
Return List[SupportingFile] for Legistar API MatterAttachments.
- Parameters:
- legistar_ev_attachments: List[Dict]
Legistar API MatterAttachments
- Returns:
- files: Optional[List[SupportingFile]]
List of supporting files if provided. None if empty list or missing information.
- get_events(begin: datetime | None = None, end: datetime | None = None) list[EventIngestionModel] [source]¶
Calls get_legistar_events_for_timespan to retrieve Legistar API data and return as List[EventIngestionModel].
- Parameters:
- begin: datetime, optional
The timespan beginning datetime to query for events after. Default is 2 days from UTC now
- end: datetime, optional
The timespan end datetime to query for events before. Default is UTC now
- Returns:
- events: List[EventIngestionModel]
One instance of EventIngestionModel per Legistar Event
- get_matter(legistar_ev: dict) Matter | None [source]¶
Return Matter from Legistar API EventItem.
- Parameters:
- legistar_ev: Dict
Legistar API EventItem
- Returns:
- matter: Optional[Matter]
List of converted Legistar matter details to CDP matter objects. None if missing information.
- get_matter_status(legistar_matter_status: str) str | None [source]¶
Return appropriate MatterStatusDecision constant from EventItemMatterStatus.
- Parameters:
- legistar_matter_status: str
Legistar API EventItemMatterStatus.
- Returns:
- matter_status: Optional[str]
A constant from CDP allowed matter status decisions. None if missing information or if matter status decision parameter patterns are not inclusive to the Legistar matter status value.
See also
- get_minutes_item(legistar_ev_item: dict) MinutesItem | None [source]¶
Return MinutesItem from parts of Legistar API EventItem.
- Parameters:
- legistar_ev_item: Dict
Legistar API EventItem
- Returns:
- minutes_item: Optional[MinutesItem]
None if could not get nonempty from EventItem.
- get_minutes_item_decision(legistar_item_passed_name: str) str | None [source]¶
Return appropriate EventMinutesItemDecision constant from EventItemPassedFlagName.
- Parameters:
- legistar_item_passed_name: str
Legistar API EventItemPassedFlagName
- Returns:
- emi_decision: Optional[str]
A constant from CDP allowed minutes item decisions. None if missing information or if minutes item decision parameter patterns are no inclusive of the Legistar minutes item decision value.
See also
- get_person(legistar_person: dict) Person | None [source]¶
Return CDP Person for Legistar Person.
- Parameters:
- legistar_person: Dict
Legistar API Person
- Returns:
- person: Optional[Person]
The Legistar Person converted to a CDP person ingestion model. None if missing information.
See also
- get_roles(legistar_office_records: list[dict[str, Any]]) list[Role] | None [source]¶
Return list of CDP Role from list of legistar OfficeRecord.
- Parameters:
- legistar_office_records: List[Dict]
Legistar API OfficeRecords
- Returns:
- roles: Optional[List[Role]]
From Legistar OfficeRecords. None if missing information.
- get_vote_decision(legistar_vote: dict) str | None [source]¶
Return appropriate VoteDecision constant based on Legistar Vote.
- Parameters:
- legistar_vote: Dict
Legistar API Vote
- Returns:
- vote_decision: Optional[str]
A constant from CDP allowed vote decisions. None if missing vote information or if vote decision parameter patterns are not inclusive of the Legistar vote value.
See also
- get_votes(legistar_votes: list[dict]) list[Vote] | None [source]¶
Return List[Vote] for Legistar API Votes.
- Parameters:
- legistar_votes: List[Dict]
Legistar votes as CDP Vote ingestion models.
- Returns:
- votes: Optional[List[Vote]]
List of votes if any were provided. None if empty list or missing information.
- inject_known_data(events: list[EventIngestionModel]) list[EventIngestionModel] [source]¶
Augment with long-term static data that changes very infrequently. e.e. self.static_data which includes Person.picture_uri,
- Parameters:
- events:
Returned events from get_events()
- Returns:
- events: List[EventIngestionModel]
Input events with static information possibly injected
- inject_known_person(person: Person) Person [source]¶
Inject information if person exists in static_data.persons.
- Parameters:
- person: Person
Person into which to inject data from static_data
- Returns:
- Person
Input person updated with information from static_data, and seat.roles sanitized.
See also
- property is_legistar_compatible: bool¶
Check that Legistar API recognizes client name.
- Returns:
- compatible: bool
True if client_name is a valid Legistar client name
- post_process_ingestion_models(events: list[EventIngestionModel]) list[EventIngestionModel] [source]¶
Called at the end of get_events() for fully custom site-specific prcessing. inject_known_data() already operated on input events.
- Parameters:
- events:
Returned events from get_events()
- Returns:
- events: List[EventIngestionModel]
Base implementation simply returns input events as-is
- resolve_person_alias(person: Person) Person | None [source]¶
If input person is in fact an alias of a reference known person, return the reference person instead. Else return person as-is.
- Parameters:
- person: Person
Person to check whether is an alias or a real unique Person
- Returns:
- Person
input person, or the correct reference Person if input person is an alias.
See also
- use_or_replace_role(role_title: str) str [source]¶
Lookup if the provided role title should be replaced with a CDP standard value. If the provided role title should be replaced, then return the proper replacement title, otherwise if the title wasn’t found in the role replacement lookup table, return the provided role_title unchanged.
- Parameters:
- role_title: str
The role title to check and potentially replace with a CDP standard.
- Returns:
- role_title: str
The original role title if no replacement was found in the role replacements lookup-table, or the CDP standard title swapped from the lookup-table.
- cdp_scrapers.legistar_utils.get_legistar_body(client: str, body_id: int, use_cache: bool = False) dict[str, Any] | None [source]¶
Return information for a single legistar body in JSON.
- Parameters:
- client: str
Which legistar client to target. Ex: “seattle”
- body_id: int
Unique ID for this body in the legistar municipality
- use_cache: bool
True: Store result to prevent querying repeatedly for same body_id
- Returns:
- body: Dict[str, Any]
legistar API body
known_legistar_bodies cache is cleared for every LegistarScraper.get_events() call
- cdp_scrapers.legistar_utils.get_legistar_content_uris(client: str, legistar_ev: dict) ContentUriScrapeResult [source]¶
Return URLs for videos and captions from a Legistar/Granicus-hosted video web page.
- Parameters:
- client: str
Which legistar client to target. Ex: “seattle”
- legistar_ev: Dict
Data for one Legistar Event.
- Returns:
- ContentUriScrapeResult
- status: ContentUriScrapeResult.Status
Status code describing the scraping process. Use uris only if status is Ok
- uris: Optional[List[ContentURIs]]
URIs for video and optional caption
- Raises:
- NotImplementedError
Means the content structure of the web page hosting session video has changed. We need explicit review and update the scraping code.
- ConnectionError
When the Legistar site (e.g. * itself may be down.
- cdp_scrapers.legistar_utils.get_legistar_events_for_timespan(client: str, begin: datetime | None = None, end: datetime | None = None) list[dict] [source]¶
Get all legistar events and each events minutes items, people, and votes, for a client for a given timespan.
- Parameters:
- client: str
Which legistar client to target. Ex: “seattle”
- begin: Optional[datetime]
The timespan beginning datetime to query for events after. Default: UTC now - 1 day
- end: Optional[datetime]
The timespan end datetime to query for events before. Default: UTC now
- Returns:
- events: List[Dict]
All legistar events that occur between the datetimes provided for the client provided. Additionally, requests and attaches agenda items, minutes items, any attachments, called “EventItems”, requests votes for any of these “EventItems”, and requests person information for any vote.
- cdp_scrapers.legistar_utils.get_legistar_person(client: str, person_id: int, use_cache: bool = False) dict[str, Any] | None [source]¶
Return information for a single legistar person in JSON.
- Parameters:
- client: str
Which legistar client to target. Ex: “seattle”
- person_id: int
Unique ID for this person in the legistar municipality
- use_cache: bool
True: Store result to prevent querying repeatedly for same person_id
- Returns:
- person: Dict[str, Any]
legistar API person
known_legistar_persons cache is cleared for every LegistarScraper.get_events() call
- cdp_scrapers.legistar_utils.parse_video_page_url(video_page_url: str, client: str) list[ContentURIs] [source]¶
Return URLs for videos and captions from a Legistar/Granicus-hosted video web page.
- Parameters:
- video_page_url: str
The URL for the page of the legistar video
- client: str
Which legistar client to target. Ex: “seattle”
- Returns:
- uris: Optional[List[ContentURIs]]
URIs for video and optional caption
cdp_scrapers.prime_gov_utils module¶
- class cdp_scrapers.prime_gov_utils.PrimeGovScraper(client_id: str, timezone: str, matter_adopted_pattern: str = 'approved|confirmed|passed|adopted|consent|(?:voted.*com+it+ee)', matter_in_progress_pattern: str = 'heard|read|filed|held|(?:in.*com+it+ee)', matter_rejected_pattern: str = 'rejected|dropped', person_aliases: Dict[str, Set[str]] | None = None)[source]¶
Adapter for civic_scraper PrimeGovSite in cdp-scrapers.
See also
- Parameters:
- client_id: str
primegov api instance id, e.g. lacity for Los Angeles, CA
- timezone: str
Local time zone
- matter_adopted_pattern: str
Regex pattern used to convert matter was adopted to CDP constant value. Default: “approved|confirmed|passed|adopted”
- matter_in_progress_pattern: str
Regex pattern used to convert matter is in-progress to CDP constant value. Default: “heard|ready|filed|held|(?:in\s*committee)”
- matter_rejected_pattern: str
Regex pattern used to convert matter was rejected to CDP constant value. Default: “rejected|dropped”
- person_aliases: Optional[Dict[str, Set[str]]] = None
Dictionary used to catch name aliases and resolve improperly different Persons to the one correct Person.
- get_body(meeting: Dict[str, Any]) Body | None [source]¶
Extract a Body from a primegov meeting dictionary.
- Parameters:
- meeting: Meeting
Target meeting
- Returns:
- Optional[Body]
Body extracted from the meeting
- get_event(meeting: Dict[str, Any]) EventIngestionModel | None [source]¶
Extract a EventIngestionModel from a primegov meeting dictionary.
- Parameters:
- meeting: Meeting
Target meeting
- Returns:
- Optional[EventIngestionModel]
EventIngestionModel extracted from the meeting
See also
- get_event_minutes_item(minutes_table: Tag) EventMinutesItem | None [source]¶
Extract event minutes item info from a minutes item <table> on agenda web page.
- Parameters:
- minutes_table: Tag
<table> tag on agenda web page for a minutes item.
- Returns:
- EventMinutesItem
Container object with matter, minutes item
See also
- get_event_minutes_items(meeting: Dict[str, Any]) List[EventMinutesItem] | None [source]¶
First find a web page for the given meeting’s agenda. Then scrape minutes items.
- Parameters:
- meeting: Meeting
Target meeting
- Returns:
- Optional[List[EventMinutesItem]]
Event minutes items scraped from the meeting agenda web page.
See also
- get_events(begin: datetime | None = None, end: datetime | None = None) List[EventIngestionModel] [source]¶
Return list of ingested events for the given time period.
- Parameters:
- begin: Optional[datetime]
The timespan beginning datetime to query for events after. Default is 2 days from UTC now
- end: Optional[datetime]
The timespan end datetime to query for events before. Default is UTC now
- Returns:
- events: List[EventIngestionModel]
One instance of EventIngestionModel per primegov api meeting
See also
- get_matter(minutes_table: Tag, minutes_item: MinutesItem | None = None) Matter | None [source]¶
Extract matter info from a minutes item <table> on agenda web page.
- Parameters:
- minutes_table: Tag
<table> tag on agenda web page for a minutes item.
- minutes_item: Optional[MinutesItem] = None
Associated minutes item that will be used to fill in some info.
- Returns:
- Matter
A Matter instance associated with a minutes item.
See also
self.matter_status_pattern_map is used to standardize result_status to one of the CDP ingetion model constants.
- get_meetings(begin: datetime, end: datetime) Iterator[Dict[str, Any]] [source]¶
Query meetings from primegov api endpoint.
- Parameters:
- begin: datetime
The timespan beginning datetime to query for events after.
- end: datetime
The timespan end datetime to query for events before.
- Returns:
- Optional[Iterator[Meeting]]
Iterator over list of meeting JSON
See also
Because of CDP’s preference for videos, meetings without video URL are filtered out.
- cdp_scrapers.prime_gov_utils.get_matter(minutes_table: Tag, minutes_item: MinutesItem | None = None) Matter | None [source]¶
Extract matter info from a minutes item <table>.
- Parameters:
- minutes_table: Tag
<table> for a minutes item on agenda web page
- minutes_item: Optional[MinutesItem] = None
Associated minutes item that will be used to fill in some info. e.g. matter title is taken from it if available.
- Returns:
- Matter
A Matter instance associated with a minutes item.
See also
Only basic string clean-up is applied, e.g. simplify whitespace. Caller is expect to clean up the data as appropriate.
- cdp_scrapers.prime_gov_utils.get_minutes_item(minutes_table: Tag) MinutesItem [source]¶
Extract minutes item name and description.
- Parameters:
- minutes_table: Tag
<table> for a minutes item on agenda web page
- Returns:
- MinutesItem
Minutes item name and description
- Raises:
- ValueError
If the <table> HTML structure is not as expected
See also
- cdp_scrapers.prime_gov_utils.get_minutes_tables(agenda: BeautifulSoup) Iterator[Tag] [source]¶
Return iterator over tables for minutes items.
- Parameters:
- agenda: Agenda
Agenda web page loaded into BeautifulSoup
- Returns:
- Iterator[Tag]
List of <table> for minutes items
- cdp_scrapers.prime_gov_utils.get_support_files(minutes_table: Tag) Iterator[SupportingFile] [source]¶
Extract the minutes item’s support file URLs.
- Parameters:
- minutes_table: Tag
<table> for a minutes item on agenda web page
- Returns:
- Iterator[SupportingFile]
List of support file information for the input minutes item
- Raises:
- ValueError
If the <table> HTML structure is not as expected
See also
- cdp_scrapers.prime_gov_utils.get_support_files_div(minutes_table: Tag) Tag [source]¶
Find the <div> containing a minutes item’s support document URLs.
- Parameters:
- minutes_table: Tag
<table> for a minutes item on agenda web page
- Returns:
- Tag
<div> with support documents for the minutes item
- cdp_scrapers.prime_gov_utils.load_agenda(url: str) BeautifulSoup | None [source]¶
Load the agenda web page.
- Parameters:
- url: str
Agenda web page URL
- Returns:
- Optional[Agenda]
Agenda web page loaded into BeautifulSoup
cdp_scrapers.scraper_utils module¶
- class cdp_scrapers.scraper_utils.IngestionModelScraper(timezone: str, person_aliases: dict[str, set[str]] | None = None)[source]¶
Base class for events scrapers providing IngestionModels for cdp-backend pipeline.
- Parameters:
- timezone: str
The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See for canonical timezones.
- person_aliases: Optional[Dict[str, Set[str]]]
Dictionary used to catch name aliases and resolve improperly different Persons to the one correct Person. Default: None
- static find_time_zone() str [source]¶
Return name for a US time zone matching UTC offset calculated from OS clock.
- get_none_if_empty(model: IngestionModel) IngestionModel | None [source]¶
Check required keys in model, return None if any such key has no value. i.e. If all required keys have valid value, return as-is.
- Parameters:
- model: IngestionModel
Person, MinutesItem, etc.
- Returns:
- model: Optional[IngestionModel]
None or model as-is
- static get_required_attrs(model: IngestionModel) list[str] [source]¶
Return list of keys required in model as specified in IngestionModel class definition.
- Parameters:
- model: IngestionModel
Person, MinutesItem, etc.
- Returns:
- attr_keys: List[str]
List of keys (attributes) in model without default value in class definition.
- handle_old_new_council(old_names: list[str], new_names: list[str]) None [source]¶
Override to handle old and new councilmember information.
- Parameters:
- old_names: list[str]
e.g. from scraper_utils.compare_persons
- new_names: list[str]
e.g. from scraper_utils.compare_persons
Base implementation simply logs
- localize_datetime(local_time: datetime) datetime [source]¶
Return input datetime with time zone information. This allows for nonambiguous conversions to other zones including UTC.
- Parameters:
- local_time: datetime
The datetime to attached timezone information to.
- Returns:
- local_time: datetime
The date and time attributes (year, month, day, hour, …) remain unchanged. tzinfo is now provided.
- resolve_person_alias(person: Person) Person [source]¶
If input person is in fact an alias of a reference known person, return the reference person instead. Else return person as-is.
- Parameters:
- person: Person
Person to check whether is an alias or a real unique Person
- Returns:
- Person
input person, or the correct reference Person if input person is an alias. This base implementation always returns person as-is.
See also
- cdp_scrapers.scraper_utils.compare_persons(scraped_persons, known_persons, primary_bodies) PersonsComparison [source]¶
Look for old and new councilmembers.
- Parameters:
- scraped_persons: list[Person]
e.g. from extract_persons
- known_persons: list[Person]
e.g. from ScraperStaticData
- primary_bodies: list[Body]
e.g. from ScraperStaticData
- Returns:
- PersonsComparison
Old and new councilmember names
- cdp_scrapers.scraper_utils.extract_persons(events)[source]¶
Get all sponsors and voters across all events.
- Parameters:
- events: list[EventIngestionModel]
Scraped events
- Returns:
- list[Person]
Unique list of all sponsors and voters found
- cdp_scrapers.scraper_utils.parse_static_file(file_path: Path, timezone: str) ScraperStaticData [source]¶
Parse Seats, Bodies and Persons from static data JSON.
- Parameters:
- file_path: Path
Path to file containing static data in JSON
- timezone: str
The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See for canonical timezones.
- Returns:
- ScraperStaticData:
Tuple[Dict[str, Seat], Dict[str, Body], Dict[str, Person]]
See also
Function looks for “seats”, “primary_bodies”, “persons” top-level keys
- cdp_scrapers.scraper_utils.parse_static_person(person_json: dict[str, Any], all_seats: dict[str, Seat], primary_bodies: dict[str, Body], timezone: timezone) Person [source]¶
Parse Dict[str, Any] for a person in static data file to a Person instance. person_json[“seat”] and person_json[“roles”] are validated against all_seats and primary_bodies in static data file.
- Parameters:
- person_json: Dict[str, Any]
A dictionary in static data file with info for a Person.
- all_seats: Dict[str, Seat]
Seats defined as top-level in static data file
- primary_bodies: Dict[str, Body]
Bodies defined as top-level in static data file.
- timezone: str
The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See for canonical timezones.
See also
- cdp_scrapers.scraper_utils.reduced_list(input_list: list[Any], collapse: bool = True) list | None [source]¶
Remove all None items from input_list.
- Parameters:
- input_list: List[Any]
Input list from which to filter out items that are None
- collapse: bool, default = True
If True, return None in place of an empty list
- Returns:
- reduced_list: Optional[List]
All items in the original list except for None values. None if all items were None and collapse is True.
- cdp_scrapers.scraper_utils.sanitize_roles(person_name: str, roles: list[Role] | None = None, static_data: ScraperStaticData | None = None, council_pres_patterns: list[str] | None = None, chair_patterns: list[str] | None = None) list[Role] | None [source]¶
Standardize roles[i].title to RoleTitle constants
Ensure only 1 councilmember Role per term.
- Parameters:
- person_name: str
Sanitization target
- roles: Optional[List[Role]] = None
target Person’s Roles to sanitize
- static_data: Optional[ScraperStaticData]
Static data defining primary council bodies and predefined See Notes.
- council_pres_patterns: List[str]
Set roles[i].title as “Council President” if match and roles[i].body is a primary body like City Council
- chair_patterns: List[str]
Set roles[i].title as “Chair” if match and roles[i].body is not a primary body
Remove roles[#] if roles[#].body in static_data.primary_bodies. Use static_data.persons[#].seat.roles instead.
If roles[i].body not in static_data.primary_bodies, roles[i].title cannot be “Councilmember” or “Council President”.
Use “City Council” and “Council Briefing” if static_data.primary_bodies is empty.
cdp_scrapers.types module¶
- class cdp_scrapers.types.ContentURIs(video_uri, caption_uri)[source]¶
Create new instance of ContentURIs(video_uri, caption_uri)
- caption_uri: str | None¶
Alias for field number 1
- video_uri: str | None¶
Alias for field number 0
- cdp_scrapers.types.LegistarContentParser¶
Function that returns URLs for videos and captions from a Legistar/Granicus-hosted video web page
- Parameters:
- client: str
Which legistar client to target. Ex: “seattle”
- video web page: BeautifulSoup
Video web page loaded into bs4
- Returns:
- uris: Optional[List[ContentURIs]]
URIs for video and optional caption
See also
alias of
] |None
- class cdp_scrapers.types.PersonsComparison(old_names, new_names)[source]¶
Create new instance of PersonsComparison(old_names, new_names)
- new_names: List[str]¶
Alias for field number 1
- old_names: List[str]¶
Alias for field number 0
- class cdp_scrapers.types.ScraperStaticData(seats, primary_bodies, persons)[source]¶
Create new instance of ScraperStaticData(seats, primary_bodies, persons)
- persons: Dict[str, Person]¶
Alias for field number 2
- primary_bodies: Dict[str, Body]¶
Alias for field number 1
- seats: Dict[str, Seat]¶
Alias for field number 0
cdp_scrapers.youtube_utils module¶
- class cdp_scrapers.youtube_utils.YoutubeIngestionScraper(channel_name: str, body_search_terms: Dict[str, str], **kwargs: Any)[source]¶
Base class for scraping CDP event ingestion models from YouTube videos.
- Parameters:
- channel_name: str
YouTube channel name where the municipality meeting videos are hosted
- body_search_terms: Dict[str, str]
e.g. {“City Council”: “city council meeting”}
- kwargs: Any
Passed to base class constructor
- get_events(begin: datetime | None = None, end: datetime | None = None) List[EventIngestionModel] [source]¶
Scrape CDP events from the meeting videos hosted on this municipality YouTube channel.
- Parameters:
- begin: Optional[datetime]
The timespan beginning datetime to query for events after. Default is 2 days from UTC now
- end: Optional[datetime]
The timespan end datetime to query for events before. Default is UTC now
- Returns:
- events: List[EventIngestionModel]
One instance of EventIngestionModel per Legistar Event
- get_session(video_info: Dict[str, Any]) Session | None [source]¶
Parse a CDP Session from YouTube video information.
- Parameters:
- video_info: Dict[str, Any]
YouTube video information from yt-dlp
- Returns:
- Optional[Session]
None if required information is missing
- iter_events(begin: datetime, end: datetime) Iterator[EventIngestionModel] [source]¶
Return iterator over events from given date range, for all known bodies in this municipality.
- Parameters:
- begin: datetime
The timespan beginning datetime to query for events after.
- end: datetime
The timespan end datetime to query for events before.
- Yields:
- EventIngestionModel
If multiple videos are found for a given body on the same day, they are treated to be sessions of the same event.
- cdp_scrapers.youtube_utils.get_video_info(query_url: str) List[Dict[str, Any]] [source]¶
Return dictionaries of search hit video meta data.
- Parameters:
- query_url: str
Full YouTube URL including the query parameters
- Returns:
- List[Dict[str, Any]]
Dictionary containing information for each search hit YouTube video
- cdp_scrapers.youtube_utils.urljoin_search_query(channel_name: str, search_terms: str, begin: datetime | None = None, end: datetime | None = None) str [source]¶
Return search URL…
- Parameters:
- channel_name: str
YouTube channel hosting the videos
- search_terms: str
Search terms, e.g. “city council meeting”
- begin: Optional[datetime]
The timespan beginning datetime to query for events after.
- end: Optional[datetime]
The timespan end datetime to query for events before.
- Returns:
- str
Full HTTPS URL for searching channel videos e.g.…
- Raises:
- ValueError
If both begin and end are None
If search term + date range is empty
Module contents¶
Top-level package for cdp_scrapers.