cdp_scrapers package

Subpackages

Submodules

cdp_scrapers.legistar_content_parsers module

cdp_scrapers.legistar_utils module

class cdp_scrapers.legistar_utils.ContentUriScrapeResult(status, uris)[source]

Bases: NamedTuple

Create new instance of ContentUriScrapeResult(status, uris)

class Status(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

Status of content parsing.

ContentNotProvidedError = -3
Ok = 0
ResourceAccessError = -2
UnrecognizedPatternError = -1
status: Status

Alias for field number 0

uris: list[ContentURIs] | None

Alias for field number 1

class cdp_scrapers.legistar_utils.LegistarScraper(client: str, timezone: str, ignore_minutes_item_patterns: list[str] | None = None, vote_approve_pattern: str = 'approve|favor|yes', vote_abstain_pattern: str = 'abstain|refuse|refrain', vote_reject_pattern: str = 'reject|oppose|no', vote_absent_pattern: str = 'absent', vote_nonvoting_pattern: str = 'nv|(?:non.*voting)', matter_adopted_pattern: str = 'approved|confirmed|passed|adopted|consent|(?:voted.*com+it+ee)', matter_in_progress_pattern: str = 'heard|read|filed|held|(?:in.*com+it+ee)', matter_rejected_pattern: str = 'rejected|dropped', minutes_item_decision_passed_pattern: str = 'pass', minutes_item_decision_failed_pattern: str = 'not|fail', static_data: ScraperStaticData | None = None, person_aliases: dict[str, set[str]] | None = None, role_replacements: dict[str, str] | None = None)[source]

Bases: IngestionModelScraper

Base class for transforming Legistar API data to CDP IngestionModel.

If get_events() naively fails and raises an error, a given installation must define a derived class and implement the get_content_uris() function.

Parameters:
client: str

Legistar client name, e.g. “seattle” for Seattle, “kingcounty” for King County.

timezone: str

The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for canonical timezones.

ignore_minutes_item_patterns: List[str]

A list of string patterns or substrings to act as a minutes item filter. Any item in the provided list will be compiled as a regex string and any minute’s item that contains the compiled pattern will be filtered out of the produced CDP minutes item list. Default: [] (do not filter any minutes items)

vote_approve_pattern: str

Regex pattern used to convert Legistar instance’s votes in approval value to CDP constant value. Default: “approve|favor|yes”

vote_abstain_pattern: str

Regex pattern used to convert Legistar instance’s abstension value to CDP constant value. Note, this is a pure abstension, not an “approval by abstention” or “rejection by abstension” value. Those should be places in vote_approve_pattern and vote_reject_pattern respectively. Default: “abstain|refuse|refrain”

vote_reject_pattern: str

Regex pattern used to convert Legistar instance’s votes in rejection value to CDP constant value. Default: “reject|oppose|no”

vote_absent_pattern: str

Regex pattern used to convert Legistar instance’s excused absense value to CDP constant value. Default: “absent”

vote_nonvoting_pattern: str

Regex pattern used to convert Legistar instance’s non-voting value to CDP constant value. Default: “nv|(?:non.*voting)”

matter_adopted_pattern: str

Regex pattern used to convert Legistar instance’s matter was adopted to CDP constant value. Default: “approved|confirmed|passed|adopted”

matter_in_progess_pattern: str

Regex pattern used to convert Legistar instance’s matter is in-progress to CDP constant value. Default: “heard|ready|filed|held|(?:ins*committee)”

matter_rejected_pattern: str

Regex pattern used to convert Legistar instance’s matter was rejected to CDP constant value. Default: “rejected|dropped”

minutes_item_decision_passed_pattern: str

Regex pattern used to convert Legistar instance’s minutes item passage to CDP constant value. Default: “pass”

minutes_item_decision_failed_pattern: str

Regex pattern used to convert Legistar instance’s minutes item failure to CDP constant value. Default: “not|fail”

static_data: Optional[ScraperStaticData]

predefined Seats, Bodies and Persons used to provide more accurate Person.seat.

person_aliases: Optional[Dict[str, Set[str]]]

Dictionary used to catch name aliases and resolve improperly unique Persons to the one correct Person. Default: None

role_replacements: Optional[Dict[str, str]]

Dictionary used to replace role titles with CDP standard role titles. The keys should be titles you want to replace and the values should be a CDP standard role. Default: None

check_for_cdp_min_ingestion(check_days: int = 7) bool[source]

Test if can obtain at least one minimally defined EventIngestionModel.

Parameters:
check_days: int, default=7

Test duration is the past check_days days from now

Returns:
minimum_ingestion_data_available: bool

True if got at least one minimally defined EventIngestionModel

static date_and_time_to_datetime(ev_date: str, ev_time: str | None) datetime[source]

Return datetime from ev_date and ev_time.

Parameters:
ev_date: str

Formatted as “%Y-%m-%dT%H:%M:%S”

ev_time: Optional[str]

Formatted as “%I:%M %p” Or None and do not attach time to date.

Returns:
datetime

date using ev_date and time using ev_time

filter_event_minutes(ev_minutes_item: EventMinutesItem) EventMinutesItem | None[source]

Return None if minutes_item.name contains unimportant text that we want to ignore.

Parameters:
ev_minutes_item: EventMinutesItem

The minutes item to filter.

Returns:
filtered_event_minutes_items: Optional[EventMinutesItem]

The allowed minutes item or None is filtered out.

fix_event_minutes(ev_minutes_item: EventMinutesItem | None, legistar_ev_item: dict) EventMinutesItem | None[source]

Inspect the MinutesItem and Matter in ev_minutes_item. - Move some fields between them to make the information more meaningful. - Enforce matter.result_status when appropriate.

Parameters:
ev_minutes_item: Optional[EventMinutesItem]

The specific event minutes item to clean. Or None if running this function in a loop with multiple event minutes items and you don’t want to clean / the emi was filtered out.

legistar_ev_item: Dict

The original Legistar EventItem.

Returns:
cleaned_emi: Optional[EventMinutesItem]

The cleaned event minutes item. This can clean both the event minutes item and the attached matter information.

get_body(legistar_body: dict[str, Any]) Body | None[source]

Return CDP Body for Legistar body.

Parameters:
legistar_body: Dict

Legistar API body

Returns:
body: Optional[body]

The Legistar body converted to a CDP body ingestion model. None if missing required information.

get_content_uris(legistar_ev: dict) list[ContentURIs][source]

Must implement in class derived from LegistarScraper. If Legistar Event.EventVideoPath is used, return an empty list in the override.

Parameters:
legistar_ev: Dict

Data for one Legistar Event.

Returns:
event_content_uris: List[ContentURIs]

List of ContentURIs objects for each session found.

Raises:
NotImplementedError

This base implementation does nothing

get_event_minutes(legistar_ev_items: list[dict]) list[EventMinutesItem] | None[source]

Return List[EventMinutesItem] for Legistar API EventItems.

Parameters:
legistar_ev_items: List[Dict]

Legistar API EventItems

Returns:
event_minutes_items: Optional[List[EventMinutesItem]]

Filtered set of event minutes items.

get_event_supporting_files(legistar_ev_attachments: list[dict]) list[SupportingFile] | None[source]

Return List[SupportingFile] for Legistar API MatterAttachments.

Parameters:
legistar_ev_attachments: List[Dict]

Legistar API MatterAttachments

Returns:
files: Optional[List[SupportingFile]]

List of supporting files if provided. None if empty list or missing information.

get_events(begin: datetime | None = None, end: datetime | None = None) list[EventIngestionModel][source]

Calls get_legistar_events_for_timespan to retrieve Legistar API data and return as List[EventIngestionModel].

Parameters:
begin: datetime, optional

The timespan beginning datetime to query for events after. Default is 2 days from UTC now

end: datetime, optional

The timespan end datetime to query for events before. Default is UTC now

Returns:
events: List[EventIngestionModel]

One instance of EventIngestionModel per Legistar Event

get_matter(legistar_ev: dict) Matter | None[source]

Return Matter from Legistar API EventItem.

Parameters:
legistar_ev: Dict

Legistar API EventItem

Returns:
matter: Optional[Matter]

List of converted Legistar matter details to CDP matter objects. None if missing information.

get_matter_status(legistar_matter_status: str) str | None[source]

Return appropriate MatterStatusDecision constant from EventItemMatterStatus.

Parameters:
legistar_matter_status: str

Legistar API EventItemMatterStatus.

Returns:
matter_status: Optional[str]

A constant from CDP allowed matter status decisions. None if missing information or if matter status decision parameter patterns are not inclusive to the Legistar matter status value.

See also

cdp_backend.database.constants.MatterStatusDecision
get_minutes_item(legistar_ev_item: dict) MinutesItem | None[source]

Return MinutesItem from parts of Legistar API EventItem.

Parameters:
legistar_ev_item: Dict

Legistar API EventItem

Returns:
minutes_item: Optional[MinutesItem]

None if could not get nonempty MinutesItem.name from EventItem.

get_minutes_item_decision(legistar_item_passed_name: str) str | None[source]

Return appropriate EventMinutesItemDecision constant from EventItemPassedFlagName.

Parameters:
legistar_item_passed_name: str

Legistar API EventItemPassedFlagName

Returns:
emi_decision: Optional[str]

A constant from CDP allowed minutes item decisions. None if missing information or if minutes item decision parameter patterns are no inclusive of the Legistar minutes item decision value.

See also

cdp_backend.database.constants.EventMinutesItemDecision
get_person(legistar_person: dict) Person | None[source]

Return CDP Person for Legistar Person.

Parameters:
legistar_person: Dict

Legistar API Person

Returns:
person: Optional[Person]

The Legistar Person converted to a CDP person ingestion model. None if missing information.

get_roles(legistar_office_records: list[dict[str, Any]]) list[Role] | None[source]

Return list of CDP Role from list of legistar OfficeRecord.

Parameters:
legistar_office_records: List[Dict]

Legistar API OfficeRecords

Returns:
roles: Optional[List[Role]]

From Legistar OfficeRecords. None if missing information.

get_sponsors(legistar_sponsors: list[dict]) list[Person] | None[source]

Get legislation sponsors.

get_vote_decision(legistar_vote: dict) str | None[source]

Return appropriate VoteDecision constant based on Legistar Vote.

Parameters:
legistar_vote: Dict

Legistar API Vote

Returns:
vote_decision: Optional[str]

A constant from CDP allowed vote decisions. None if missing vote information or if vote decision parameter patterns are not inclusive of the Legistar vote value.

See also

cdp_backend.database.constants.VoteDecision
get_votes(legistar_votes: list[dict]) list[Vote] | None[source]

Return List[Vote] for Legistar API Votes.

Parameters:
legistar_votes: List[Dict]

Legistar votes as CDP Vote ingestion models.

Returns:
votes: Optional[List[Vote]]

List of votes if any were provided. None if empty list or missing information.

inject_known_data(events: list[EventIngestionModel]) list[EventIngestionModel][source]

Augment with long-term static data that changes very infrequently. e.e. self.static_data which includes Person.picture_uri, Person.seat.

Parameters:
events:

Returned events from get_events()

Returns:
events: List[EventIngestionModel]

Input events with static information possibly injected

inject_known_person(person: Person) Person[source]

Inject information if person exists in static_data.persons.

Parameters:
person: Person

Person into which to inject data from static_data

Returns:
Person

Input person updated with information from static_data, and seat.roles sanitized.

See also

scraper_utils.sanitize_roles
property is_legistar_compatible: bool

Check that Legistar API recognizes client name.

Returns:
compatible: bool

True if client_name is a valid Legistar client name

post_process_ingestion_models(events: list[EventIngestionModel]) list[EventIngestionModel][source]

Called at the end of get_events() for fully custom site-specific prcessing. inject_known_data() already operated on input events.

Parameters:
events:

Returned events from get_events()

Returns:
events: List[EventIngestionModel]

Base implementation simply returns input events as-is

resolve_person_alias(person: Person) Person | None[source]

If input person is in fact an alias of a reference known person, return the reference person instead. Else return person as-is.

Parameters:
person: Person

Person to check whether is an alias or a real unique Person

Returns:
Person

input person, or the correct reference Person if input person is an alias.

See also

instances.seattle.person_aliases
use_or_replace_role(role_title: str) str[source]

Lookup if the provided role title should be replaced with a CDP standard value. If the provided role title should be replaced, then return the proper replacement title, otherwise if the title wasn’t found in the role replacement lookup table, return the provided role_title unchanged.

Parameters:
role_title: str

The role title to check and potentially replace with a CDP standard.

Returns:
role_title: str

The original role title if no replacement was found in the role replacements lookup-table, or the CDP standard title swapped from the lookup-table.

cdp_scrapers.legistar_utils.get_legistar_body(client: str, body_id: int, use_cache: bool = False) dict[str, Any] | None[source]

Return information for a single legistar body in JSON.

Parameters:
client: str

Which legistar client to target. Ex: “seattle”

body_id: int

Unique ID for this body in the legistar municipality

use_cache: bool

True: Store result to prevent querying repeatedly for same body_id

Returns:
body: Dict[str, Any]

legistar API body

Notes

known_legistar_bodies cache is cleared for every LegistarScraper.get_events() call

cdp_scrapers.legistar_utils.get_legistar_content_uris(client: str, legistar_ev: dict) ContentUriScrapeResult[source]

Return URLs for videos and captions from a Legistar/Granicus-hosted video web page.

Parameters:
client: str

Which legistar client to target. Ex: “seattle”

legistar_ev: Dict

Data for one Legistar Event.

Returns:
ContentUriScrapeResult
status: ContentUriScrapeResult.Status

Status code describing the scraping process. Use uris only if status is Ok

uris: Optional[List[ContentURIs]]

URIs for video and optional caption

Raises:
NotImplementedError

Means the content structure of the web page hosting session video has changed. We need explicit review and update the scraping code.

ConnectionError

When the Legistar site (e.g. *.legistar.com) itself may be down.

cdp_scrapers.legistar_utils.get_legistar_events_for_timespan(client: str, begin: datetime | None = None, end: datetime | None = None) list[dict][source]

Get all legistar events and each events minutes items, people, and votes, for a client for a given timespan.

Parameters:
client: str

Which legistar client to target. Ex: “seattle”

begin: Optional[datetime]

The timespan beginning datetime to query for events after. Default: UTC now - 1 day

end: Optional[datetime]

The timespan end datetime to query for events before. Default: UTC now

Returns:
events: List[Dict]

All legistar events that occur between the datetimes provided for the client provided. Additionally, requests and attaches agenda items, minutes items, any attachments, called “EventItems”, requests votes for any of these “EventItems”, and requests person information for any vote.

cdp_scrapers.legistar_utils.get_legistar_person(client: str, person_id: int, use_cache: bool = False) dict[str, Any] | None[source]

Return information for a single legistar person in JSON.

Parameters:
client: str

Which legistar client to target. Ex: “seattle”

person_id: int

Unique ID for this person in the legistar municipality

use_cache: bool

True: Store result to prevent querying repeatedly for same person_id

Returns:
person: Dict[str, Any]

legistar API person

Notes

known_legistar_persons cache is cleared for every LegistarScraper.get_events() call

cdp_scrapers.legistar_utils.parse_video_page_url(video_page_url: str, client: str) list[ContentURIs][source]

Return URLs for videos and captions from a Legistar/Granicus-hosted video web page.

Parameters:
video_page_url: str

The URL for the page of the legistar video

client: str

Which legistar client to target. Ex: “seattle”

Returns:
uris: Optional[List[ContentURIs]]

URIs for video and optional caption

cdp_scrapers.prime_gov_utils module

class cdp_scrapers.prime_gov_utils.PrimeGovScraper(client_id: str, timezone: str, matter_adopted_pattern: str = 'approved|confirmed|passed|adopted|consent|(?:voted.*com+it+ee)', matter_in_progress_pattern: str = 'heard|read|filed|held|(?:in.*com+it+ee)', matter_rejected_pattern: str = 'rejected|dropped', person_aliases: Dict[str, Set[str]] | None = None)[source]

Bases: PrimeGovSite, IngestionModelScraper

Adapter for civic_scraper PrimeGovSite in cdp-scrapers.

See also

civic_scraper.platforms.primegov.site.PrimeGoveSite
cdp_screapers.scraper_utils.IngestionModelScraper
Parameters:
client_id: str

primegov api instance id, e.g. lacity for Los Angeles, CA

timezone: str

Local time zone

matter_adopted_pattern: str

Regex pattern used to convert matter was adopted to CDP constant value. Default: “approved|confirmed|passed|adopted”

matter_in_progress_pattern: str

Regex pattern used to convert matter is in-progress to CDP constant value. Default: “heard|ready|filed|held|(?:in\s*committee)”

matter_rejected_pattern: str

Regex pattern used to convert matter was rejected to CDP constant value. Default: “rejected|dropped”

person_aliases: Optional[Dict[str, Set[str]]] = None

Dictionary used to catch name aliases and resolve improperly different Persons to the one correct Person.

get_body(meeting: Dict[str, Any]) Body | None[source]

Extract a Body from a primegov meeting dictionary.

Parameters:
meeting: Meeting

Target meeting

Returns:
Optional[Body]

Body extracted from the meeting

get_event(meeting: Dict[str, Any]) EventIngestionModel | None[source]

Extract a EventIngestionModel from a primegov meeting dictionary.

Parameters:
meeting: Meeting

Target meeting

Returns:
Optional[EventIngestionModel]

EventIngestionModel extracted from the meeting

get_event_minutes_item(minutes_table: Tag) EventMinutesItem | None[source]

Extract event minutes item info from a minutes item <table> on agenda web page.

Parameters:
minutes_table: Tag

<table> tag on agenda web page for a minutes item.

Returns:
EventMinutesItem

Container object with matter, minutes item

get_event_minutes_items(meeting: Dict[str, Any]) List[EventMinutesItem] | None[source]

First find a web page for the given meeting’s agenda. Then scrape minutes items.

Parameters:
meeting: Meeting

Target meeting

Returns:
Optional[List[EventMinutesItem]]

Event minutes items scraped from the meeting agenda web page.

get_events(begin: datetime | None = None, end: datetime | None = None) List[EventIngestionModel][source]

Return list of ingested events for the given time period.

Parameters:
begin: Optional[datetime]

The timespan beginning datetime to query for events after. Default is 2 days from UTC now

end: Optional[datetime]

The timespan end datetime to query for events before. Default is UTC now

Returns:
events: List[EventIngestionModel]

One instance of EventIngestionModel per primegov api meeting

See also

get_meetings
get_matter(minutes_table: Tag, minutes_item: MinutesItem | None = None) Matter | None[source]

Extract matter info from a minutes item <table> on agenda web page.

Parameters:
minutes_table: Tag

<table> tag on agenda web page for a minutes item.

minutes_item: Optional[MinutesItem] = None

Associated minutes item that will be used to fill in some info.

Returns:
Matter

A Matter instance associated with a minutes item.

See also

matter_status_pattern_map
get_matter

Notes

self.matter_status_pattern_map is used to standardize result_status to one of the CDP ingetion model constants.

get_meetings(begin: datetime, end: datetime) Iterator[Dict[str, Any]][source]

Query meetings from primegov api endpoint.

Parameters:
begin: datetime

The timespan beginning datetime to query for events after.

end: datetime

The timespan end datetime to query for events before.

Returns:
Optional[Iterator[Meeting]]

Iterator over list of meeting JSON

See also

get_events

Notes

Because of CDP’s preference for videos, meetings without video URL are filtered out.

get_minutes_item(minutes_table: Tag) MinutesItem | None[source]

Extract a minutes item from a <table> on agenda web page.

Parameters:
minutes_table: Tag

<table> tag on agenda web page for a minutes item.

Returns:
Optional[MinutesItem]

MinutesItem from given <table>

See also

get_minutes_item
get_session(meeting: Dict[str, Any]) Session | None[source]

Extract a Session from a primegov meeting dictionary.

Parameters:
meeting: Meeting

Target meeting

Returns:
Optional[Session]

Session extracted from the meeting

cdp_scrapers.prime_gov_utils.get_matter(minutes_table: Tag, minutes_item: MinutesItem | None = None) Matter | None[source]

Extract matter info from a minutes item <table>.

Parameters:
minutes_table: Tag

<table> for a minutes item on agenda web page

minutes_item: Optional[MinutesItem] = None

Associated minutes item that will be used to fill in some info. e.g. matter title is taken from it if available.

Returns:
Matter

A Matter instance associated with a minutes item.

Notes

Only basic string clean-up is applied, e.g. simplify whitespace. Caller is expect to clean up the data as appropriate.

cdp_scrapers.prime_gov_utils.get_minutes_item(minutes_table: Tag) MinutesItem[source]

Extract minutes item name and description.

Parameters:
minutes_table: Tag

<table> for a minutes item on agenda web page

Returns:
MinutesItem

Minutes item name and description

Raises:
ValueError

If the <table> HTML structure is not as expected

cdp_scrapers.prime_gov_utils.get_minutes_tables(agenda: BeautifulSoup) Iterator[Tag][source]

Return iterator over tables for minutes items.

Parameters:
agenda: Agenda

Agenda web page loaded into BeautifulSoup

Returns:
Iterator[Tag]

List of <table> for minutes items

cdp_scrapers.prime_gov_utils.get_support_files(minutes_table: Tag) Iterator[SupportingFile][source]

Extract the minutes item’s support file URLs.

Parameters:
minutes_table: Tag

<table> for a minutes item on agenda web page

Returns:
Iterator[SupportingFile]

List of support file information for the input minutes item

Raises:
ValueError

If the <table> HTML structure is not as expected

cdp_scrapers.prime_gov_utils.get_support_files_div(minutes_table: Tag) Tag[source]

Find the <div> containing a minutes item’s support document URLs.

Parameters:
minutes_table: Tag

<table> for a minutes item on agenda web page

Returns:
Tag

<div> with support documents for the minutes item

cdp_scrapers.prime_gov_utils.load_agenda(url: str) BeautifulSoup | None[source]

Load the agenda web page.

Parameters:
url: str

Agenda web page URL

Returns:
Optional[Agenda]

Agenda web page loaded into BeautifulSoup

cdp_scrapers.prime_gov_utils.primegov_strftime(dt: datetime) str[source]

strftime() in format expected for search by primegov api.

Parameters:
dt: datetime

datetime to convert

Returns:
str

Input datetime in string

See also

civic_scraper.platforms.primegov.site.PrimeGovSite.scrape
cdp_scrapers.prime_gov_utils.primegov_strptime(meeting: Dict[str, Any]) datetime | None[source]

strptime() on meeting_date_time using expected format commonly used in primegov api.

Parameters:
meeting: Meeting

Target meeting

Returns:
Optional[datetime]

Meeting’s date and time

cdp_scrapers.scraper_utils module

class cdp_scrapers.scraper_utils.IngestionModelScraper(timezone: str, person_aliases: dict[str, set[str]] | None = None)[source]

Bases: object

Base class for events scrapers providing IngestionModels for cdp-backend pipeline.

Parameters:
timezone: str

The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for canonical timezones.

person_aliases: Optional[Dict[str, Set[str]]]

Dictionary used to catch name aliases and resolve improperly different Persons to the one correct Person. Default: None

static find_time_zone() str[source]

Return name for a US time zone matching UTC offset calculated from OS clock.

get_none_if_empty(model: IngestionModel) IngestionModel | None[source]

Check required keys in model, return None if any such key has no value. i.e. If all required keys have valid value, return as-is.

Parameters:
model: IngestionModel

Person, MinutesItem, etc.

Returns:
model: Optional[IngestionModel]

None or model as-is

static get_required_attrs(model: IngestionModel) list[str][source]

Return list of keys required in model as specified in IngestionModel class definition.

Parameters:
model: IngestionModel

Person, MinutesItem, etc.

Returns:
attr_keys: List[str]

List of keys (attributes) in model without default value in class definition.

handle_old_new_council(old_names: list[str], new_names: list[str]) None[source]

Override to handle old and new councilmember information.

Parameters:
old_names: list[str]

e.g. from scraper_utils.compare_persons

new_names: list[str]

e.g. from scraper_utils.compare_persons

Notes

Base implementation simply logs

localize_datetime(local_time: datetime) datetime[source]

Return input datetime with time zone information. This allows for nonambiguous conversions to other zones including UTC.

Parameters:
local_time: datetime

The datetime to attached timezone information to.

Returns:
local_time: datetime

The date and time attributes (year, month, day, hour, …) remain unchanged. tzinfo is now provided.

resolve_person_alias(person: Person) Person[source]

If input person is in fact an alias of a reference known person, return the reference person instead. Else return person as-is.

Parameters:
person: Person

Person to check whether is an alias or a real unique Person

Returns:
Person

input person, or the correct reference Person if input person is an alias. This base implementation always returns person as-is.

See also

instances.seattle.person_aliases
cdp_scrapers.scraper_utils.compare_persons(scraped_persons, known_persons, primary_bodies) PersonsComparison[source]

Look for old and new councilmembers.

Parameters:
scraped_persons: list[Person]

e.g. from extract_persons

known_persons: list[Person]

e.g. from ScraperStaticData

primary_bodies: list[Body]

e.g. from ScraperStaticData

Returns:
PersonsComparison

Old and new councilmember names

cdp_scrapers.scraper_utils.extract_persons(events)[source]

Get all sponsors and voters across all events.

Parameters:
events: list[EventIngestionModel]

Scraped events

Returns:
list[Person]

Unique list of all sponsors and voters found

cdp_scrapers.scraper_utils.parse_static_file(file_path: Path, timezone: str) ScraperStaticData[source]

Parse Seats, Bodies and Persons from static data JSON.

Parameters:
file_path: Path

Path to file containing static data in JSON

timezone: str

The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for canonical timezones.

Returns:
ScraperStaticData:

Tuple[Dict[str, Seat], Dict[str, Body], Dict[str, Person]]

Notes

Function looks for “seats”, “primary_bodies”, “persons” top-level keys

cdp_scrapers.scraper_utils.parse_static_person(person_json: dict[str, Any], all_seats: dict[str, Seat], primary_bodies: dict[str, Body], timezone: timezone) Person[source]

Parse Dict[str, Any] for a person in static data file to a Person instance. person_json[“seat”] and person_json[“roles”] are validated against all_seats and primary_bodies in static data file.

Parameters:
person_json: Dict[str, Any]

A dictionary in static data file with info for a Person.

all_seats: Dict[str, Seat]

Seats defined as top-level in static data file

primary_bodies: Dict[str, Body]

Bodies defined as top-level in static data file.

timezone: str

The timezone for the target client. i.e. “America/Los_Angeles” or “America/New_York” See https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for canonical timezones.

cdp_scrapers.scraper_utils.reduced_list(input_list: list[Any], collapse: bool = True) list | None[source]

Remove all None items from input_list.

Parameters:
input_list: List[Any]

Input list from which to filter out items that are None

collapse: bool, default = True

If True, return None in place of an empty list

Returns:
reduced_list: Optional[List]

All items in the original list except for None values. None if all items were None and collapse is True.

cdp_scrapers.scraper_utils.sanitize_roles(person_name: str, roles: list[Role] | None = None, static_data: ScraperStaticData | None = None, council_pres_patterns: list[str] | None = None, chair_patterns: list[str] | None = None) list[Role] | None[source]
  1. Standardize roles[i].title to RoleTitle constants

  2. Ensure only 1 councilmember Role per term.

Parameters:
person_name: str

Sanitization target Person.name

roles: Optional[List[Role]] = None

target Person’s Roles to sanitize

static_data: Optional[ScraperStaticData]

Static data defining primary council bodies and predefined Person.seat.roles. See Notes.

council_pres_patterns: List[str]

Set roles[i].title as “Council President” if match and roles[i].body is a primary body like City Council

chair_patterns: List[str]

Set roles[i].title as “Chair” if match and roles[i].body is not a primary body

Notes

Remove roles[#] if roles[#].body in static_data.primary_bodies. Use static_data.persons[#].seat.roles instead.

If roles[i].body not in static_data.primary_bodies, roles[i].title cannot be “Councilmember” or “Council President”.

Use “City Council” and “Council Briefing” if static_data.primary_bodies is empty.

cdp_scrapers.scraper_utils.str_simplified(input_str: str) str[source]

Remove leading and trailing whitespaces, simplify multiple whitespaces, unify newline characters.

Parameters:
input_str: str

The string to be cleaned.

Returns:
cleaned: str

input_str stripped if it is a string

cdp_scrapers.types module

class cdp_scrapers.types.ContentURIs(video_uri, caption_uri)[source]

Bases: NamedTuple

Create new instance of ContentURIs(video_uri, caption_uri)

caption_uri: str | None

Alias for field number 1

video_uri: str | None

Alias for field number 0

cdp_scrapers.types.LegistarContentParser

Function that returns URLs for videos and captions from a Legistar/Granicus-hosted video web page

Parameters:
client: str

Which legistar client to target. Ex: “seattle”

video web page: BeautifulSoup

Video web page loaded into bs4

Returns:
uris: Optional[List[ContentURIs]]

URIs for video and optional caption

alias of Callable[[str, BeautifulSoup], List[ContentURIs] | None]

class cdp_scrapers.types.PersonsComparison(old_names, new_names)[source]

Bases: NamedTuple

Create new instance of PersonsComparison(old_names, new_names)

new_names: List[str]

Alias for field number 1

old_names: List[str]

Alias for field number 0

class cdp_scrapers.types.ScraperStaticData(seats, primary_bodies, persons)[source]

Bases: NamedTuple

Create new instance of ScraperStaticData(seats, primary_bodies, persons)

persons: Dict[str, Person]

Alias for field number 2

primary_bodies: Dict[str, Body]

Alias for field number 1

seats: Dict[str, Seat]

Alias for field number 0

cdp_scrapers.youtube_utils module

class cdp_scrapers.youtube_utils.YoutubeIngestionScraper(channel_name: str, body_search_terms: Dict[str, str], **kwargs: Any)[source]

Bases: IngestionModelScraper

Base class for scraping CDP event ingestion models from YouTube videos.

Parameters:
channel_name: str

YouTube channel name where the municipality meeting videos are hosted

body_search_terms: Dict[str, str]

e.g. {“City Council”: “city council meeting”}

kwargs: Any

Passed to base class constructor

get_events(begin: datetime | None = None, end: datetime | None = None) List[EventIngestionModel][source]

Scrape CDP events from the meeting videos hosted on this municipality YouTube channel.

Parameters:
begin: Optional[datetime]

The timespan beginning datetime to query for events after. Default is 2 days from UTC now

end: Optional[datetime]

The timespan end datetime to query for events before. Default is UTC now

Returns:
events: List[EventIngestionModel]

One instance of EventIngestionModel per Legistar Event

get_session(video_info: Dict[str, Any]) Session | None[source]

Parse a CDP Session from YouTube video information.

Parameters:
video_info: Dict[str, Any]

YouTube video information from yt-dlp

Returns:
Optional[Session]

None if required information is missing

iter_events(begin: datetime, end: datetime) Iterator[EventIngestionModel][source]

Return iterator over events from given date range, for all known bodies in this municipality.

Parameters:
begin: datetime

The timespan beginning datetime to query for events after.

end: datetime

The timespan end datetime to query for events before.

Yields:
EventIngestionModel

Notes

If multiple videos are found for a given body on the same day, they are treated to be sessions of the same event.

parse_datetime(title: str) datetime[source]

Parse video datetime from title text.

Parameters:
title: str

YouTube video title

Returns:
datetime

datetime instance for the video.

Notes

Override for custom parsing. Default expects month_name day, year e.g. January 1, 1960

cdp_scrapers.youtube_utils.get_video_info(query_url: str) List[Dict[str, Any]][source]

Return dictionaries of search hit video meta data.

Parameters:
query_url: str

Full YouTube URL including the query parameters

Returns:
List[Dict[str, Any]]

Dictionary containing information for each search hit YouTube video

cdp_scrapers.youtube_utils.urljoin_search_query(channel_name: str, search_terms: str, begin: datetime | None = None, end: datetime | None = None) str[source]

Return search URL https://www.youtube.com/@channel/search?query=

Parameters:
channel_name: str

YouTube channel hosting the videos

search_terms: str

Search terms, e.g. “city council meeting”

begin: Optional[datetime]

The timespan beginning datetime to query for events after.

end: Optional[datetime]

The timespan end datetime to query for events before.

Returns:
str

Full HTTPS URL for searching channel videos e.g. https://www.youtube.com/@chanel/search?…

Raises:
ValueError
  • If both begin and end are None

  • If search term + date range is empty

Module contents

Top-level package for cdp_scrapers.