cdp_scrapers package¶

Subpackages¶

cdp_scrapers.instances package

Submodules¶

cdp_scrapers.legistar_content_parsers module¶

cdp_scrapers.legistar_utils module¶

class cdp_scrapers.legistar_utils.ContentUriScrapeResult(status, uris)[source]¶

Bases: NamedTuple

Create new instance of ContentUriScrapeResult(status, uris)

class Status(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Bases: IntEnum

Status of content parsing.

ContentNotProvidedError = -3¶

Ok = 0¶

ResourceAccessError = -2¶

UnrecognizedPatternError = -1¶

status: Status¶: Alias for field number 0

uris: list[ContentURIs] | None¶: Alias for field number 1

Bases: IngestionModelScraper

Base class for transforming Legistar API data to CDP IngestionModel.

If get_events() naively fails and raises an error, a given installation must define a derived class and implement the get_content_uris() function.

Parameters:

client: str: Legistar client name, e.g. “seattle” for Seattle, “kingcounty” for King County.
timezone: str: The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for canonical timezones.
ignore_minutes_item_patterns: List[str]: A list of string patterns or substrings to act as a minutes item filter. Any item in the provided list will be compiled as a regex string and any minute’s item that contains the compiled pattern will be filtered out of the produced CDP minutes item list. Default: [] (do not filter any minutes items)
vote_approve_pattern: str: Regex pattern used to convert Legistar instance’s votes in approval value to CDP constant value. Default: “approve|favor|yes”
vote_abstain_pattern: str: Regex pattern used to convert Legistar instance’s abstension value to CDP constant value. Note, this is a pure abstension, not an “approval by abstention” or “rejection by abstension” value. Those should be places in vote_approve_pattern and vote_reject_pattern respectively. Default: “abstain|refuse|refrain”
vote_reject_pattern: str: Regex pattern used to convert Legistar instance’s votes in rejection value to CDP constant value. Default: “reject|oppose|no”
vote_absent_pattern: str: Regex pattern used to convert Legistar instance’s excused absense value to CDP constant value. Default: “absent”
vote_nonvoting_pattern: str: Regex pattern used to convert Legistar instance’s non-voting value to CDP constant value. Default: “nv|(?:non.*voting)”
matter_adopted_pattern: str: Regex pattern used to convert Legistar instance’s matter was adopted to CDP constant value. Default: “approved|confirmed|passed|adopted”
matter_in_progess_pattern: str: Regex pattern used to convert Legistar instance’s matter is in-progress to CDP constant value. Default: “heard|ready|filed|held|(?:ins*committee)”
matter_rejected_pattern: str: Regex pattern used to convert Legistar instance’s matter was rejected to CDP constant value. Default: “rejected|dropped”
minutes_item_decision_passed_pattern: str: Regex pattern used to convert Legistar instance’s minutes item passage to CDP constant value. Default: “pass”
minutes_item_decision_failed_pattern: str: Regex pattern used to convert Legistar instance’s minutes item failure to CDP constant value. Default: “not|fail”
static_data: Optional[ScraperStaticData]: predefined Seats, Bodies and Persons used to provide more accurate Person.seat.
person_aliases: Optional[Dict[str, Set[str]]]: Dictionary used to catch name aliases and resolve improperly unique Persons to the one correct Person. Default: None
role_replacements: Optional[Dict[str, str]]: Dictionary used to replace role titles with CDP standard role titles. The keys should be titles you want to replace and the values should be a CDP standard role. Default: None

See also

cdp_scrapers.legistar_utils.LegistarScraper.get_content_uris
cdp_scrapers.instances.seattle.SeattleScraper

check_for_cdp_min_ingestion(check_days: int = 7) → bool[source]¶

Test if can obtain at least one minimally defined EventIngestionModel.

Parameters:

check_days: int, default=7: Test duration is the past check_days days from now

Returns:

minimum_ingestion_data_available: bool: True if got at least one minimally defined EventIngestionModel

static date_and_time_to_datetime(ev_date: str, ev_time: str | None) → datetime[source]¶

Return datetime from ev_date and ev_time.

Parameters:

ev_date: str: Formatted as “%Y-%m-%dT%H:%M:%S”
ev_time: Optional[str]: Formatted as “%I:%M %p” Or None and do not attach time to date.

Returns:

datetime: date using ev_date and time using ev_time

filter_event_minutes(ev_minutes_item: EventMinutesItem) → EventMinutesItem | None[source]¶

Return None if minutes_item.name contains unimportant text that we want to ignore.

Parameters:

ev_minutes_item: EventMinutesItem: The minutes item to filter.

Returns:

filtered_event_minutes_items: Optional[EventMinutesItem]: The allowed minutes item or None is filtered out.

fix_event_minutes(ev_minutes_item: EventMinutesItem | None, legistar_ev_item: dict) → EventMinutesItem | None[source]¶

Inspect the MinutesItem and Matter in ev_minutes_item. - Move some fields between them to make the information more meaningful. - Enforce matter.result_status when appropriate.

Parameters:

ev_minutes_item: Optional[EventMinutesItem]: The specific event minutes item to clean. Or None if running this function in a loop with multiple event minutes items and you don’t want to clean / the emi was filtered out.
legistar_ev_item: Dict: The original Legistar EventItem.

Returns:

cleaned_emi: Optional[EventMinutesItem]: The cleaned event minutes item. This can clean both the event minutes item and the attached matter information.

get_body(legistar_body: dict[str, Any]) → Body | None[source]¶

Return CDP Body for Legistar body.

Parameters:

legistar_body: Dict: Legistar API body

Returns:

body: Optional[body]: The Legistar body converted to a CDP body ingestion model. None if missing required information.

See also

get_legistar_body

get_content_uris(legistar_ev: dict) → list[ContentURIs][source]¶

Must implement in class derived from LegistarScraper. If Legistar Event.EventVideoPath is used, return an empty list in the override.

Parameters:

legistar_ev: Dict: Data for one Legistar Event.

Returns:

event_content_uris: List[ContentURIs]: List of ContentURIs objects for each session found.

Raises:

NotImplementedError: This base implementation does nothing

See also

cdp_scrapers.legistar_utils.get_legistar_events_for_timespan

get_event_minutes(legistar_ev_items: list[dict]) → list[EventMinutesItem] | None[source]¶

Return List[EventMinutesItem] for Legistar API EventItems.

Parameters:

legistar_ev_items: List[Dict]: Legistar API EventItems

Returns:

event_minutes_items: Optional[List[EventMinutesItem]]: Filtered set of event minutes items.

get_event_supporting_files(legistar_ev_attachments: list[dict]) → list[SupportingFile] | None[source]¶

Return List[SupportingFile] for Legistar API MatterAttachments.

Parameters:

legistar_ev_attachments: List[Dict]: Legistar API MatterAttachments

Returns:

files: Optional[List[SupportingFile]]: List of supporting files if provided. None if empty list or missing information.

get_events(begin: datetime | None = None, end: datetime | None = None) → list[EventIngestionModel][source]¶

Calls get_legistar_events_for_timespan to retrieve Legistar API data and return as List[EventIngestionModel].

Parameters:

begin: datetime, optional: The timespan beginning datetime to query for events after. Default is 2 days from UTC now
end: datetime, optional: The timespan end datetime to query for events before. Default is UTC now

Returns:

events: List[EventIngestionModel]: One instance of EventIngestionModel per Legistar Event

See also

cdp_scrapers.legistar_utils.get_legistar_events_for_timespan

get_matter(legistar_ev: dict) → Matter | None[source]¶

Return Matter from Legistar API EventItem.

Parameters:

legistar_ev: Dict: Legistar API EventItem

Returns:

matter: Optional[Matter]: List of converted Legistar matter details to CDP matter objects. None if missing information.

get_matter_status(legistar_matter_status: str) → str | None[source]¶

Return appropriate MatterStatusDecision constant from EventItemMatterStatus.

Parameters:

legistar_matter_status: str: Legistar API EventItemMatterStatus.

Returns:

matter_status: Optional[str]: A constant from CDP allowed matter status decisions. None if missing information or if matter status decision parameter patterns are not inclusive to the Legistar matter status value.

See also

cdp_backend.database.constants.MatterStatusDecision

get_minutes_item(legistar_ev_item: dict) → MinutesItem | None[source]¶

Return MinutesItem from parts of Legistar API EventItem.

Parameters:

legistar_ev_item: Dict: Legistar API EventItem

Returns:

minutes_item: Optional[MinutesItem]: None if could not get nonempty MinutesItem.name from EventItem.

get_minutes_item_decision(legistar_item_passed_name: str) → str | None[source]¶

Return appropriate EventMinutesItemDecision constant from EventItemPassedFlagName.

Parameters:

legistar_item_passed_name: str: Legistar API EventItemPassedFlagName

Returns:

emi_decision: Optional[str]: A constant from CDP allowed minutes item decisions. None if missing information or if minutes item decision parameter patterns are no inclusive of the Legistar minutes item decision value.

See also

cdp_backend.database.constants.EventMinutesItemDecision

get_person(legistar_person: dict) → Person | None[source]¶

Return CDP Person for Legistar Person.

Parameters:

legistar_person: Dict: Legistar API Person

Returns:

person: Optional[Person]: The Legistar Person converted to a CDP person ingestion model. None if missing information.

See also

get_legistar_person

get_roles(legistar_office_records: list[dict[str, Any]]) → list[Role] | None[source]¶

Return list of CDP Role from list of legistar OfficeRecord.

Parameters:

legistar_office_records: List[Dict]: Legistar API OfficeRecords

Returns:

roles: Optional[List[Role]]: From Legistar OfficeRecords. None if missing information.

get_sponsors(legistar_sponsors: list[dict]) → list[Person] | None[source]¶: Get legislation sponsors.

get_vote_decision(legistar_vote: dict) → str | None[source]¶

Return appropriate VoteDecision constant based on Legistar Vote.

Parameters:

legistar_vote: Dict: Legistar API Vote

Returns:

vote_decision: Optional[str]: A constant from CDP allowed vote decisions. None if missing vote information or if vote decision parameter patterns are not inclusive of the Legistar vote value.

See also

cdp_backend.database.constants.VoteDecision

get_votes(legistar_votes: list[dict]) → list[Vote] | None[source]¶

Return List[Vote] for Legistar API Votes.

Parameters:

legistar_votes: List[Dict]: Legistar votes as CDP Vote ingestion models.

Returns:

votes: Optional[List[Vote]]: List of votes if any were provided. None if empty list or missing information.

inject_known_data(events: list[EventIngestionModel]) → list[EventIngestionModel][source]¶

Augment with long-term static data that changes very infrequently. e.e. self.static_data which includes Person.picture_uri, Person.seat.

Parameters:

events:: Returned events from get_events()

Returns:

events: List[EventIngestionModel]: Input events with static information possibly injected

inject_known_person(person: Person) → Person[source]¶

Inject information if person exists in static_data.persons.

Parameters:

person: Person: Person into which to inject data from static_data

Returns:

Person: Input person updated with information from static_data, and seat.roles sanitized.

See also

scraper_utils.sanitize_roles

property is_legistar_compatible: bool¶

Check that Legistar API recognizes client name.

Returns:

compatible: bool: True if client_name is a valid Legistar client name

post_process_ingestion_models(events: list[EventIngestionModel]) → list[EventIngestionModel][source]¶

Called at the end of get_events() for fully custom site-specific prcessing. inject_known_data() already operated on input events.

Parameters:

events:: Returned events from get_events()

Returns:

events: List[EventIngestionModel]: Base implementation simply returns input events as-is

resolve_person_alias(person: Person) → Person | None[source]¶

If input person is in fact an alias of a reference known person, return the reference person instead. Else return person as-is.

Parameters:

person: Person: Person to check whether is an alias or a real unique Person

Returns:

Person: input person, or the correct reference Person if input person is an alias.

See also

instances.seattle.person_aliases

use_or_replace_role(role_title: str) → str[source]¶

Lookup if the provided role title should be replaced with a CDP standard value. If the provided role title should be replaced, then return the proper replacement title, otherwise if the title wasn’t found in the role replacement lookup table, return the provided role_title unchanged.

Parameters:

role_title: str: The role title to check and potentially replace with a CDP standard.

Returns:

role_title: str: The original role title if no replacement was found in the role replacements lookup-table, or the CDP standard title swapped from the lookup-table.

cdp_scrapers.legistar_utils.get_legistar_body(client: str, body_id: int, use_cache: bool = False) → dict[str, Any] | None[source]¶

Return information for a single legistar body in JSON.

Parameters:

client: str: Which legistar client to target. Ex: “seattle”
body_id: int: Unique ID for this body in the legistar municipality
use_cache: bool: True: Store result to prevent querying repeatedly for same body_id

Returns:

body: Dict[str, Any]: legistar API body

Notes

known_legistar_bodies cache is cleared for every LegistarScraper.get_events() call

cdp_scrapers.legistar_utils.get_legistar_content_uris(client: str, legistar_ev: dict) → ContentUriScrapeResult[source]¶

Return URLs for videos and captions from a Legistar/Granicus-hosted video web page.

Parameters:

client: str: Which legistar client to target. Ex: “seattle”
legistar_ev: Dict: Data for one Legistar Event.

Returns:

ContentUriScrapeResult

status: ContentUriScrapeResult.Status: Status code describing the scraping process. Use uris only if status is Ok
uris: Optional[List[ContentURIs]]: URIs for video and optional caption

Raises:

NotImplementedError: Means the content structure of the web page hosting session video has changed. We need explicit review and update the scraping code.
ConnectionError: When the Legistar site (e.g. *.legistar.com) itself may be down.

See also

LegistarScraper.get_content_uris
cdp_scrapers.legistar_content_parsers

cdp_scrapers.legistar_utils.get_legistar_events_for_timespan(client: str, begin: datetime | None = None, end: datetime | None = None) → list[dict][source]¶

Get all legistar events and each events minutes items, people, and votes, for a client for a given timespan.

Parameters:

client: str: Which legistar client to target. Ex: “seattle”
begin: Optional[datetime]: The timespan beginning datetime to query for events after. Default: UTC now - 1 day
end: Optional[datetime]: The timespan end datetime to query for events before. Default: UTC now

Returns:

events: List[Dict]: All legistar events that occur between the datetimes provided for the client provided. Additionally, requests and attaches agenda items, minutes items, any attachments, called “EventItems”, requests votes for any of these “EventItems”, and requests person information for any vote.

cdp_scrapers.legistar_utils.get_legistar_person(client: str, person_id: int, use_cache: bool = False) → dict[str, Any] | None[source]¶

Return information for a single legistar person in JSON.

Parameters:

client: str: Which legistar client to target. Ex: “seattle”
person_id: int: Unique ID for this person in the legistar municipality
use_cache: bool: True: Store result to prevent querying repeatedly for same person_id

Returns:

person: Dict[str, Any]: legistar API person

Notes

known_legistar_persons cache is cleared for every LegistarScraper.get_events() call

cdp_scrapers.legistar_utils.parse_video_page_url(video_page_url: str, client: str) → list[ContentURIs][source]¶

Return URLs for videos and captions from a Legistar/Granicus-hosted video web page.

Parameters:

video_page_url: str: The URL for the page of the legistar video
client: str: Which legistar client to target. Ex: “seattle”

Returns:

uris: Optional[List[ContentURIs]]: URIs for video and optional caption

cdp_scrapers.prime_gov_utils module¶

Bases: PrimeGovSite, IngestionModelScraper

Adapter for civic_scraper PrimeGovSite in cdp-scrapers.

See also

civic_scraper.platforms.primegov.site.PrimeGoveSite
cdp_screapers.scraper_utils.IngestionModelScraper

Parameters:

client_id: str: primegov api instance id, e.g. lacity for Los Angeles, CA
timezone: str: Local time zone
matter_adopted_pattern: str: Regex pattern used to convert matter was adopted to CDP constant value. Default: “approved|confirmed|passed|adopted”
matter_in_progress_pattern: str: Regex pattern used to convert matter is in-progress to CDP constant value. Default: “heard|ready|filed|held|(?:in\s*committee)”
matter_rejected_pattern: str: Regex pattern used to convert matter was rejected to CDP constant value. Default: “rejected|dropped”
person_aliases: Optional[Dict[str, Set[str]]] = None: Dictionary used to catch name aliases and resolve improperly different Persons to the one correct Person.

get_body(meeting: Dict[str, Any]) → Body | None[source]¶

Extract a Body from a primegov meeting dictionary.

Parameters:

meeting: Meeting: Target meeting

Returns:

Optional[Body]: Body extracted from the meeting

get_event(meeting: Dict[str, Any]) → EventIngestionModel | None[source]¶

Extract a EventIngestionModel from a primegov meeting dictionary.

Parameters:

meeting: Meeting: Target meeting

Returns:

Optional[EventIngestionModel]: EventIngestionModel extracted from the meeting

See also

get_body
get_session

get_event_minutes_item(minutes_table: Tag) → EventMinutesItem | None[source]¶

Extract event minutes item info from a minutes item <table> on agenda web page.

Parameters:

minutes_table: Tag: <table> tag on agenda web page for a minutes item.

Returns:

EventMinutesItem: Container object with matter, minutes item

See also

get_matter
get_minutes_item
get_support_files

get_event_minutes_items(meeting: Dict[str, Any]) → List[EventMinutesItem] | None[source]¶

First find a web page for the given meeting’s agenda. Then scrape minutes items.

Parameters:

meeting: Meeting: Target meeting

Returns:

Optional[List[EventMinutesItem]]: Event minutes items scraped from the meeting agenda web page.

See also

get_event_minutes_item

get_events(begin: datetime | None = None, end: datetime | None = None) → List[EventIngestionModel][source]¶

Return list of ingested events for the given time period.

Parameters:

begin: Optional[datetime]: The timespan beginning datetime to query for events after. Default is 2 days from UTC now
end: Optional[datetime]: The timespan end datetime to query for events before. Default is UTC now

Returns:

events: List[EventIngestionModel]: One instance of EventIngestionModel per primegov api meeting

See also

get_meetings

get_matter(minutes_table: Tag, minutes_item: MinutesItem | None = None) → Matter | None[source]¶

Extract matter info from a minutes item <table> on agenda web page.

Parameters:

minutes_table: Tag: <table> tag on agenda web page for a minutes item.
minutes_item: Optional[MinutesItem] = None: Associated minutes item that will be used to fill in some info.

Returns:

Matter: A Matter instance associated with a minutes item.

See also

matter_status_pattern_map
get_matter

Notes

self.matter_status_pattern_map is used to standardize result_status to one of the CDP ingetion model constants.

get_meetings(begin: datetime, end: datetime) → Iterator[Dict[str, Any]][source]¶

Query meetings from primegov api endpoint.

Parameters:

begin: datetime: The timespan beginning datetime to query for events after.
end: datetime: The timespan end datetime to query for events before.

Returns:

Optional[Iterator[Meeting]]: Iterator over list of meeting JSON

See also

get_events

Notes

Because of CDP’s preference for videos, meetings without video URL are filtered out.

get_minutes_item(minutes_table: Tag) → MinutesItem | None[source]¶

Extract a minutes item from a <table> on agenda web page.

Parameters:

minutes_table: Tag: <table> tag on agenda web page for a minutes item.

Returns:

Optional[MinutesItem]: MinutesItem from given <table>

See also

get_minutes_item

get_session(meeting: Dict[str, Any]) → Session | None[source]¶

Extract a Session from a primegov meeting dictionary.

Parameters:

meeting: Meeting: Target meeting

Returns:

Optional[Session]: Session extracted from the meeting

cdp_scrapers.prime_gov_utils.get_matter(minutes_table: Tag, minutes_item: MinutesItem | None = None) → Matter | None[source]¶

Extract matter info from a minutes item <table>.

Parameters:

minutes_table: Tag: <table> for a minutes item on agenda web page
minutes_item: Optional[MinutesItem] = None: Associated minutes item that will be used to fill in some info. e.g. matter title is taken from it if available.

Returns:

Matter: A Matter instance associated with a minutes item.

See also

get_minutes_tables

Notes

Only basic string clean-up is applied, e.g. simplify whitespace. Caller is expect to clean up the data as appropriate.

cdp_scrapers.prime_gov_utils.get_minutes_item(minutes_table: Tag) → MinutesItem[source]¶

Extract minutes item name and description.

Parameters:

minutes_table: Tag: <table> for a minutes item on agenda web page

Returns:

MinutesItem: Minutes item name and description

Raises:

ValueError: If the <table> HTML structure is not as expected

See also

get_minutes_tables

cdp_scrapers.prime_gov_utils.get_minutes_tables(agenda: BeautifulSoup) → Iterator[Tag][source]¶

Return iterator over tables for minutes items.

Parameters:

agenda: Agenda: Agenda web page loaded into BeautifulSoup

Returns:

Iterator[Tag]: List of <table> for minutes items

cdp_scrapers.prime_gov_utils.get_support_files(minutes_table: Tag) → Iterator[SupportingFile][source]¶

Extract the minutes item’s support file URLs.

Parameters:

minutes_table: Tag: <table> for a minutes item on agenda web page

Returns:

Iterator[SupportingFile]: List of support file information for the input minutes item

Raises:

ValueError: If the <table> HTML structure is not as expected

See also

get_minutes_tables

cdp_scrapers.prime_gov_utils.get_support_files_div(minutes_table: Tag) → Tag[source]¶

Find the <div> containing a minutes item’s support document URLs.

Parameters:

minutes_table: Tag: <table> for a minutes item on agenda web page

Returns:

Tag: <div> with support documents for the minutes item

cdp_scrapers.prime_gov_utils.load_agenda(url: str) → BeautifulSoup | None[source]¶

Load the agenda web page.

Parameters:

url: str: Agenda web page URL

Returns:

Optional[Agenda]: Agenda web page loaded into BeautifulSoup

cdp_scrapers.prime_gov_utils.primegov_strftime(dt: datetime) → str[source]¶

strftime() in format expected for search by primegov api.

Parameters:

dt: datetime: datetime to convert

Returns:

str: Input datetime in string

See also

civic_scraper.platforms.primegov.site.PrimeGovSite.scrape

cdp_scrapers.prime_gov_utils.primegov_strptime(meeting: Dict[str, Any]) → datetime | None[source]¶

strptime() on meeting_date_time using expected format commonly used in primegov api.

Parameters:

meeting: Meeting: Target meeting

Returns:

Optional[datetime]: Meeting’s date and time

cdp_scrapers.scraper_utils module¶

class cdp_scrapers.scraper_utils.IngestionModelScraper(timezone: str, person_aliases: dict[str, set[str]] | None = None)[source]¶

Bases: object

Base class for events scrapers providing IngestionModels for cdp-backend pipeline.

Parameters:

timezone: str: The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for canonical timezones.
person_aliases: Optional[Dict[str, Set[str]]]: Dictionary used to catch name aliases and resolve improperly different Persons to the one correct Person. Default: None

static find_time_zone() → str[source]¶: Return name for a US time zone matching UTC offset calculated from OS clock.

get_none_if_empty(model: IngestionModel) → IngestionModel | None[source]¶

Check required keys in model, return None if any such key has no value. i.e. If all required keys have valid value, return as-is.

Parameters:

model: IngestionModel: Person, MinutesItem, etc.

Returns:

model: Optional[IngestionModel]: None or model as-is

static get_required_attrs(model: IngestionModel) → list[str][source]¶

Return list of keys required in model as specified in IngestionModel class definition.

Parameters:

model: IngestionModel: Person, MinutesItem, etc.

Returns:

attr_keys: List[str]: List of keys (attributes) in model without default value in class definition.

handle_old_new_council(old_names: list[str], new_names: list[str]) → None[source]¶

Override to handle old and new councilmember information.

Parameters:

old_names: list[str]: e.g. from scraper_utils.compare_persons
new_names: list[str]: e.g. from scraper_utils.compare_persons

Notes

Base implementation simply logs

localize_datetime(local_time: datetime) → datetime[source]¶

Return input datetime with time zone information. This allows for nonambiguous conversions to other zones including UTC.

Parameters:

local_time: datetime: The datetime to attached timezone information to.

Returns:

local_time: datetime: The date and time attributes (year, month, day, hour, …) remain unchanged. tzinfo is now provided.

resolve_person_alias(person: Person) → Person[source]¶

If input person is in fact an alias of a reference known person, return the reference person instead. Else return person as-is.

Parameters:

person: Person: Person to check whether is an alias or a real unique Person

Returns:

Person: input person, or the correct reference Person if input person is an alias. This base implementation always returns person as-is.

See also

instances.seattle.person_aliases

cdp_scrapers.scraper_utils.compare_persons(scraped_persons, known_persons, primary_bodies) → PersonsComparison[source]¶

Look for old and new councilmembers.

Parameters:

scraped_persons: list[Person]: e.g. from extract_persons
known_persons: list[Person]: e.g. from ScraperStaticData
primary_bodies: list[Body]: e.g. from ScraperStaticData

Returns:

PersonsComparison: Old and new councilmember names

cdp_scrapers.scraper_utils.extract_persons(events)[source]¶

Get all sponsors and voters across all events.

Parameters:

events: list[EventIngestionModel]: Scraped events

Returns:

list[Person]: Unique list of all sponsors and voters found

cdp_scrapers.scraper_utils.parse_static_file(file_path: Path, timezone: str) → ScraperStaticData[source]¶

Parse Seats, Bodies and Persons from static data JSON.

Parameters:

file_path: Path: Path to file containing static data in JSON
timezone: str: The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for canonical timezones.

Returns:

ScraperStaticData:: Tuple[Dict[str, Seat], Dict[str, Body], Dict[str, Person]]

See also

parse_static_person
sanitize_roles

Notes

Function looks for “seats”, “primary_bodies”, “persons” top-level keys

cdp_scrapers.scraper_utils.parse_static_person(person_json: dict[str, Any], all_seats: dict[str, Seat], primary_bodies: dict[str, Body], timezone: timezone) → Person[source]¶

Parse Dict[str, Any] for a person in static data file to a Person instance. person_json[“seat”] and person_json[“roles”] are validated against all_seats and primary_bodies in static data file.

Parameters:

person_json: Dict[str, Any]: A dictionary in static data file with info for a Person.
all_seats: Dict[str, Seat]: Seats defined as top-level in static data file
primary_bodies: Dict[str, Body]: Bodies defined as top-level in static data file.
timezone: str: The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for canonical timezones.

See also

parse_static_file
sanitize_roles

cdp_scrapers.scraper_utils.reduced_list(input_list: list[Any], collapse: bool = True) → list | None[source]¶

Remove all None items from input_list.

Parameters:

input_list: List[Any]: Input list from which to filter out items that are None
collapse: bool, default = True: If True, return None in place of an empty list

Returns:

reduced_list: Optional[List]: All items in the original list except for None values. None if all items were None and collapse is True.

cdp_scrapers.scraper_utils.sanitize_roles(person_name: str, roles: list[Role] | None = None, static_data: ScraperStaticData | None = None, council_pres_patterns: list[str] | None = None, chair_patterns: list[str] | None = None) → list[Role] | None[source]¶

Standardize roles[i].title to RoleTitle constants
Ensure only 1 councilmember Role per term.

Parameters:

person_name: str: Sanitization target Person.name
roles: Optional[List[Role]] = None: target Person’s Roles to sanitize
static_data: Optional[ScraperStaticData]: Static data defining primary council bodies and predefined Person.seat.roles. See Notes.
council_pres_patterns: List[str]: Set roles[i].title as “Council President” if match and roles[i].body is a primary body like City Council
chair_patterns: List[str]: Set roles[i].title as “Chair” if match and roles[i].body is not a primary body

Notes

Remove roles[#] if roles[#].body in static_data.primary_bodies. Use static_data.persons[#].seat.roles instead.

If roles[i].body not in static_data.primary_bodies, roles[i].title cannot be “Councilmember” or “Council President”.

Use “City Council” and “Council Briefing” if static_data.primary_bodies is empty.

cdp_scrapers.scraper_utils.str_simplified(input_str: str) → str[source]¶

Remove leading and trailing whitespaces, simplify multiple whitespaces, unify newline characters.

Parameters:

input_str: str: The string to be cleaned.

Returns:

cleaned: str: input_str stripped if it is a string

cdp_scrapers.types module¶

class cdp_scrapers.types.ContentURIs(video_uri, caption_uri)[source]¶

Bases: NamedTuple

Create new instance of ContentURIs(video_uri, caption_uri)

caption_uri: str | None¶: Alias for field number 1

video_uri: str | None¶: Alias for field number 0

cdp_scrapers.types.LegistarContentParser¶

Function that returns URLs for videos and captions from a Legistar/Granicus-hosted video web page

Parameters:

client: str: Which legistar client to target. Ex: “seattle”
video web page: BeautifulSoup: Video web page loaded into bs4

Returns:

uris: Optional[List[ContentURIs]]: URIs for video and optional caption

See also

cdp_scrapers.legistar_content_parsers
cdp_scrapers.legistar_utils.get_legistar_content_uris

alias of Callable[[str, BeautifulSoup], List[ContentURIs] | None]

class cdp_scrapers.types.PersonsComparison(old_names, new_names)[source]¶

Bases: NamedTuple

Create new instance of PersonsComparison(old_names, new_names)

new_names: List[str]¶: Alias for field number 1

old_names: List[str]¶: Alias for field number 0

class cdp_scrapers.types.ScraperStaticData(seats, primary_bodies, persons)[source]¶

Bases: NamedTuple

Create new instance of ScraperStaticData(seats, primary_bodies, persons)

persons: Dict[str, Person]¶: Alias for field number 2

primary_bodies: Dict[str, Body]¶: Alias for field number 1

seats: Dict[str, Seat]¶: Alias for field number 0

cdp_scrapers.youtube_utils module¶

class cdp_scrapers.youtube_utils.YoutubeIngestionScraper(channel_name: str, body_search_terms: Dict[str, str], **kwargs: Any)[source]¶

Bases: IngestionModelScraper

Base class for scraping CDP event ingestion models from YouTube videos.

Parameters:

channel_name: str: YouTube channel name where the municipality meeting videos are hosted
body_search_terms: Dict[str, str]: e.g. {“City Council”: “city council meeting”}
kwargs: Any: Passed to base class constructor

get_events(begin: datetime | None = None, end: datetime | None = None) → List[EventIngestionModel][source]¶

Scrape CDP events from the meeting videos hosted on this municipality YouTube channel.

Parameters:

begin: Optional[datetime]: The timespan beginning datetime to query for events after. Default is 2 days from UTC now
end: Optional[datetime]: The timespan end datetime to query for events before. Default is UTC now

Returns:

events: List[EventIngestionModel]: One instance of EventIngestionModel per Legistar Event

get_session(video_info: Dict[str, Any]) → Session | None[source]¶

Parse a CDP Session from YouTube video information.

Parameters:

video_info: Dict[str, Any]: YouTube video information from yt-dlp

Returns:

Optional[Session]: None if required information is missing

iter_events(begin: datetime, end: datetime) → Iterator[EventIngestionModel][source]¶

Return iterator over events from given date range, for all known bodies in this municipality.

Parameters:

begin: datetime: The timespan beginning datetime to query for events after.
end: datetime: The timespan end datetime to query for events before.

Yields:

EventIngestionModel

Notes

If multiple videos are found for a given body on the same day, they are treated to be sessions of the same event.

parse_datetime(title: str) → datetime[source]¶

Parse video datetime from title text.

Parameters:

title: str: YouTube video title

Returns:

datetime: datetime instance for the video.

Notes

Override for custom parsing. Default expects month_name day, year e.g. January 1, 1960

cdp_scrapers.youtube_utils.get_video_info(query_url: str) → List[Dict[str, Any]][source]¶

Return dictionaries of search hit video meta data.

Parameters:

query_url: str: Full YouTube URL including the query parameters

Returns:

List[Dict[str, Any]]: Dictionary containing information for each search hit YouTube video

cdp_scrapers.youtube_utils.urljoin_search_query(channel_name: str, search_terms: str, begin: datetime | None = None, end: datetime | None = None) → str[source]¶

Return search URL https://www.youtube.com/@channel/search?query=…

Parameters:

channel_name: str: YouTube channel hosting the videos
search_terms: str: Search terms, e.g. “city council meeting”
begin: Optional[datetime]: The timespan beginning datetime to query for events after.
end: Optional[datetime]: The timespan end datetime to query for events before.

Returns:

str: Full HTTPS URL for searching channel videos e.g. https://www.youtube.com/@chanel/search?…

Raises:

ValueError

If both begin and end are None
If search term + date range is empty

Module contents¶

Top-level package for cdp_scrapers.