cdp_scrapers.instances package¶

Submodules¶

cdp_scrapers.instances.atlanta module¶

cdp_scrapers.instances.atlanta.assign_constant(driver: WebDriver, i: int, j: int, vote_decision: str, voting_list: list, body_name: str, persons: dict)[source]¶

Assign constants and add Vote to the ingestion models based on the vote decision.

Parameters:

driver:webdriver: webdriver of the matter page
i: int: tr[i] is the matter we are looking at
j: int: the row number of the information in a matter that we are looking at
vote_decision: str: the vote decision constant of the vote decision
voting_list: list: the list that contains vote ingestion models
body_name: str: the body name of the current meeting
persons: dict: Dict[str, ingestion_models.Person]

cdp_scrapers.instances.atlanta.convert_status_constant(decision: str) → str[source]¶

Converts the matter result status to the exsiting constants.

Parameters:

decision: str: decision of the matter

Returns:

db_constants: result status constants

cdp_scrapers.instances.atlanta.get_date(driver: WebDriver, url: str, from_dt: datetime, to_dt: datetime) → list[source]¶

Get a list of ingestion models for the meetings hold during the selected time range.

Parameters:

driver:webdriver: empty webdriver
url:str: the url of the calender page
from_dt:: the begin date
to_dt:: the end date

Returns:

list: all the ingestion models for the selected date range

cdp_scrapers.instances.atlanta.get_events(from_dt: datetime, to_dt: datetime) → list[source]¶

gets the right calender link feed it to the function that get a list of ingestion models.

Parameters:

from_dt:: the begin date
to_dt:: the end date

Returns:

list: all the ingestion models for the selected date range

cdp_scrapers.instances.atlanta.get_matter_status(driver: WebDriver, i: int) → Tuple[list, str][source]¶

Find the matter result status.

Parameters:

driver:webdriver: webdriver of the matter page
i: int: tracker used to loop the rows in the matter page

Returns:

sub_sections: element: the block under the matter for the current date
decision_constant: element: the matter decision constant

cdp_scrapers.instances.atlanta.get_new_person(name: str) → Person[source]¶

Creates the person ingestion model for the people that are not recored.

Parameters:

name:str: the name of the person

Returns:

ingestion model: the person ingestion model for the newly appeared person

cdp_scrapers.instances.atlanta.get_person() → dict[source]¶

Put the informtion get by get_single_person() to dictionary.

Returns:

dictionary: key: person’s name value: person’s ingestion model

cdp_scrapers.instances.atlanta.get_single_person(driver: WebDriver, member_name: str) → Person[source]¶

Get all the information fot one person Includes: role, seat, picture, phone, email.

Parameters:

driver:: webdriver calling the people’s dictionary page
member_name:: person’s name

Returns:

ingestion_models: the ingestion model for the person’s part

cdp_scrapers.instances.atlanta.get_voting_result(driver: WebDriver, sub_sections_len: int, i: int, body_name: str, persons: dict) → list[source]¶

Scrapes and converts the voting decisions to the exsiting constants.

Parameters:

driver:webdriver: webdriver of the matter page
sub_sections_len: int: the row number in the block under the matter for the current date
i: int: tr[i] is the matter we are looking at
body_name: str: the body name of the current meeting
persons: dict: Dict[str, ingestion_models.Person]

Returns:

list: contains the Vote ingestion model for each person

cdp_scrapers.instances.atlanta.get_year(driver: WebDriver, url: str, from_dt: datetime) → str[source]¶

Navigate to the year that we are looking for.

Parameters:

driver:webdriver: empty webdriver
url:str: the url of the calender page
from_dt:datetime: the datetime object for the search target year

Returns:

link:str: the link to the calender of the year that we are looking for

cdp_scrapers.instances.atlanta.parse_event(url: str) → EventIngestionModel[source]¶

Scrapes all the information for a meeting.

Parameters:

url:str: the url of the meeting that we want to scrape

Returns:

ingestion model: the ingestion model for the meeting

cdp_scrapers.instances.atlanta.parse_single_matter(driver: WebDriver, test: str, item: str, body_name: str, s_word_formated: datetime, persons: dict) → EventMinutesItem[source]¶

Get the minute items that contains a matter.

Parameters:

driver:webdriver: webdriver of the matter page
matter:element: the matter we are looking at
body_name: str: the body name of the current meeting
s_word_formated: datetime: the date of the current meeting
persons: dict: Dict[str, ingestion_models.Person]

Returns:

ingestion model: minutes ingestion model with the matters information

cdp_scrapers.instances.empty module¶

cdp_scrapers.instances.empty.get_events(from_dt: datetime, to_dt: datetime, **kwargs: Any) → List[EventIngestionModel][source]¶

Get all events for the provided timespan.

Parameters:

from_dt: datetime: Datetime to start event gather from.
to_dt: datetime: Datetime to end event gather at.
kwargs: Any: Any keyword arguments to provide to downstream functions.

Returns:

events: List[EventIngestionModel]: All events gathered that occured in the provided time range.

Notes

As the implimenter of the get_events function, you can choose to ignore the from_dt and to_dt parameters. However, they are useful for manually kicking off pipelines from GitHub Actions UI.

cdp_scrapers.instances.houston module¶

class cdp_scrapers.instances.houston.AgendaType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Bases: IntEnum

Pdf = 2¶

WebPage = 1¶

class cdp_scrapers.instances.houston.HoustonScraper[source]¶

Bases: IngestionModelScraper

get_agenda(element: Tag) → Tag | NavigableString | None[source]¶

Get event agenda for a specific details page.

Parameters:

element: Tag: The element from which we want to get agenda

Returns:

AgendaType, Tag: Resource type for the agenda and the agenda resource itself

get_all_elements_in_range(time_from: datetime, time_to: datetime) → Dict[str, Tag][source]¶

Get all the meetings in a range of dates.

Parameters:

time_from: datetime: Earliest meeting date to look at
time_to: datetime: Latest meeting date to look at

Returns:

Dict[str, Tag]: Dictionary of mapping between the date of the meeting and the element for the meeting in that date

get_body_name(event: Tag | NavigableString | None) → str[source]¶

Get the body name for an event.

Parameters:

event: Union[Tag, NavigableString, None]: All elements in the page that we want to scrape

Returns:

str: The body name

get_date_mainlink(element: Tag) → str[source]¶

Find the main link for one event.

Parameters:

element: Tag: The element of one event

Returns:

str: The main link for this event

get_diff_yearid(event_date: datetime) → str[source]¶

Get the events in different years as the events for different years are stored in different tabs. Can get multiple events across years.

Parameters:

event_date: datetime: The date of the event we are trying to parse

Returns:

str: The year id that can locate the year tab where the event is stored

get_event(date: str, element: Tag) → EventIngestionModel[source]¶

Parse one event at a specific date. City council meeting information for a specific date.

Parameters:

date: str: the date of this meeting
element: Tag: the meeting Tag element

Returns:

ingestion_models.EventIngestionModel: EventIngestionModel for one meeting date

get_event_minutes_item(event: Tag | NavigableString | None) → List[EventMinutesItem][source]¶

Parse the page and gather the event minute items.

Parameters:

event: Union[Tag, NavigableString, None]: All elements in the page that we want to scrape

Returns:

List[ingestion_models.EventMinutesItem]: All the event minute items gathered from the event on the page

get_events(from_dt: datetime, to_dt: datetime) → List[EventIngestionModel][source]¶

Get all city council meetings information within a specific time range.

Parameters:

from_dt: datetime: The start date of the time range
to_dt: datetime: The end date of the time range

Returns:

list[ingestion_models.EventIngestionModel]: A list of EventIngestionModel that contains all city council meetings information within a specific time range

remove_extra_type(element: Tag | NavigableString | None) → Tag[source]¶

Remove types that are not useful.

Parameters:

element: Union[Tag, NavigableString, None]: The element in the page that we want to scrape

Returns:

Tag: Same elements as received, assuming the elements are not null

cdp_scrapers.instances.houston.get_houston_events(from_dt: datetime | None = None, to_dt: datetime | None = None, **kwargs: Any) → List[EventIngestionModel][source]¶

Public API for use in instances.__init__ so that this func can be attached as an attribute to cdp_scrapers.instances module. Thus the outside world like cdp-backend can get at this by asking for “get_portland_events”.

Parameters:

from_dt: datetime, optional: The timespan beginning datetime to query for events after. Default is 2 days from UTC now
to_dt: datetime, optional: The timespan end datetime to query for events before. Default is UTC now
kwargs: Any: Any extra keyword arguments to pass to the get_events function.

Returns:

events: List[EventIngestionModel]

See also

cdp_scrapers.instances.__init__.py

cdp_scrapers.instances.kingcounty module¶

class cdp_scrapers.instances.kingcounty.KingCountyScraper[source]¶

Bases: LegistarScraper

King County specific implementation of LegistarScraper.

PYTHON_MUNICIPALITY_SLUG: str = 'king_county'¶

static dump_static_info(file_path: Path) → None[source]¶

Call this to save current council members information as Persons in json format to file_path. Intended to be called once every N years when the council changes.

Parameters:

file_path: Path: output json file path

static get_static_person_info() → Dict[str, Person][source]¶

Scrape current council members information from kingcounty.gov.

Returns:

persons: Dict[str, Person]: keyed by name

Notes

Parse https://kingcounty.gov/council/councilmembers/find_district.aspx that contains list of current council members name, position, contact info

cdp_scrapers.instances.lacity module¶

class cdp_scrapers.instances.lacity.LosAngelesScraper[source]¶

Bases: PrimeGovScraper

LA, CA specific implementation of PrimeGovScraper.

PYTHON_MUNICIPALITY_SLUG: str = 'lacity'¶

cdp_scrapers.instances.lacity.get_lacity_events(from_dt: datetime | None = None, to_dt: datetime | None = None, **kwargs: Any) → List[EventIngestionModel][source]¶

Parameters:

from_dt: datetime, optional: The timespan beginning datetime to query for events after. Default is 2 days from UTC now
to_dt: datetime, optional: The timespan end datetime to query for events before. Default is UTC now
kwargs: Any: Any extra keyword arguments to pass to the get events function.

Returns:

events: List[EventIngestionModel]

See also

cdp_scrapers.instances.__init__.py

cdp_scrapers.instances.portland module¶

class cdp_scrapers.instances.portland.PortlandScraper[source]¶

Bases: IngestionModelScraper

get_agenda_uri(event_page: BeautifulSoup) → str | None[source]¶

Find the uri for the file containing the agenda for a Portland, OR city council meeting.

Parameters:

event_page: BeautifulSoup: Web page for the meeting loaded as a bs4 object

Returns:

agenda_uri: Optional[str]: The uri for the file containing the meeting’s agenda

get_doc_number(minute_section: Tag, event_page: BeautifulSoup) → str[source]¶

Find the document number in the minute_section.

Parameters:

minute_section: Tag: <div> within event web page for a given event minute item
event_page: BeautifulSoup: The entire page where the event is found

Returns:

doc_number: str: The document number in the minute_section If this is null, return the section top number with the year

get_event(event_time: datetime) → EventIngestionModel | None[source]¶

Portland, OR city council meeting information for a specific date.

Parameters:

event_time: datetime: Meeting date

Returns:

Optional[EventIngestionModel]: None if there was no meeting on event_time or information for the meeting did not meet minimal CDP requirements.

get_event_minutes(event_page: BeautifulSoup) → list[EventMinutesItem] | None[source]¶

Make EventMinutesItem from each relation–type-agenda-item <div> on event_page.

Parameters:

event_page: BeautifulSoup: Web page for the meeting loaded as a bs4 object

Returns:

event minute items: Optional[List[EventMinutesItem]]

get_events(begin: datetime | None = None, end: datetime | None = None) → list[EventIngestionModel][source]¶

Portland, OR city council meeting information over given time span as List[EventIngestionModel].

Parameters:

begin: datetime, optional: The timespan beginning datetime to query for events after. Default is 2 days from UTC now
end: datetime, optional: The timespan end datetime to query for events before. Default is UTC now

Returns:

events: List[EventIngestionModel]

References

https://www.portland.gov/council/agenda/all

get_matter(minute_section: Tag, event_page: BeautifulSoup) → Matter | None[source]¶

Make Matter from information in minute_section.

Parameters:

minute_section: Tag: <div> within event web page for a given event minute item
event_page: BeautifulSoup: The entire page where the event is found

Returns:

matter: Optional[Matter]: Matter if required information could be parsed from minute_section

get_person(name: str) → Person[source]¶

Return matching Person from portland-static.json.

Parameters:

name: str: Person full name

Returns:

person: Person: Matching Person from portland-static.json

Raises:

KeyError: If name does not exist in portland-static.json

References

portland-static.json

get_section_top_number(minute_section: Tag, event_page: BeautifulSoup) → str[source]¶

Find the top section number in the minute_section.

Parameters:

minute_section: Tag: <div> within event web page for a given event minute item
event_page: BeautifulSoup: The entire page where the event is found

Returns:

doc_number: str: The top section number in the minute_section, with the year appended at the end

get_sessions(event_page: BeautifulSoup) → list[Session] | None[source]¶

Parse meeting video URIs from event_page, return Session for each video found.

Parameters:

event_page: BeautifulSoup: Web page for the meeting loaded as a bs4 object

Returns:

sessions: Optional[List[Session]]: Session for each video found on event_page

get_supporting_files(minute_section: Tag) → list[SupportingFile] | None[source]¶

Return SupportingFiles for a given EventMinutesItem.

Parameters:

minute_section: Tag: <div> within event web page for a given event minute item

Returns:

supporting files: Optional[List[SupportingFile]]

See also

make_efile_url

Notes

Follow hyperlink to go to minutes item details page. On the details page look for directly-linked files and externally-hosted efiles.

get_votes(minute_section: Tag) → list[Vote] | None[source]¶

Look for ‘Votes:’ in minute_section and create a Vote object for each line.

Parameters:

minute_section: Tag: <div> within event web page for a given event minute item

Returns:

votes: Optional[List[Vote]]: Votes for corresponding event minute item if found

class cdp_scrapers.instances.portland.WebPageSoup(status, soup)[source]¶

Bases: NamedTuple

Create new instance of WebPageSoup(status, soup)

soup: BeautifulSoup | None¶: Alias for field number 1

status: bool¶: Alias for field number 0

cdp_scrapers.instances.portland.disposition_to_minute_decision(disposition: str) → EventMinutesItemDecision | None[source]¶

Decide EventMinutesItemDecision constant from event minute item disposition.

Parameters:

disposition: str: Disposition event web page for a given item e.g. Passed, Continued

Returns:

decision: Optional[EventMinutesItemDecision]

See also

MINUTE_ITEM_PASSED_PATTERNS

cdp_scrapers.instances.portland.get_disposition(minute_section: Tag) → str[source]¶

Return disposition string given within minute_section <div> on the event web page.

Parameters:

minute_section: Tag: <div> within event web page for a given event minute item

Returns:

disposition: str: Disposition string for the event minute item e.g. Accepted, Passed, Placed on file

cdp_scrapers.instances.portland.get_portland_events(from_dt: datetime | None = None, to_dt: datetime | None = None, **kwargs: Any) → list[EventIngestionModel][source]¶

Parameters:

from_dt: datetime, optional: The timespan beginning datetime to query for events after. Default is 2 days from UTC now
to_dt: datetime, optional: The timespan end datetime to query for events before. Default is UTC now
kwargs: Any: Any extra keywords arguments to pass to the get events function.

Returns:

events: List[EventIngestionModel]

See also

cdp_scrapers.instances.__init__.py

cdp_scrapers.instances.portland.load_web_page(url: str | Request) → WebPageSoup[source]¶

Load web page at url and return content soupified.

Parameters:

url: str | urllib.request.Request: Web page to load

Returns:

result: WebPageSoup: WebPageSoup.status = False if web page at url could not be loaded

cdp_scrapers.instances.portland.make_efile_url(efile_page_url: str) → str[source]¶

Helper function to get file download link on a Portland EFile hosting web page.

Parameters:

efile_page_url: str: URL to Portland efile hosting web page e.g. https://efiles.portlandoregon.gov/record/14803529

Returns:

efile url: str: URL to the file itself e.g. https://efiles.portlandoregon.gov/record/14803529/File/Document

cdp_scrapers.instances.portland.separate_name_from_title(title_and_name: str) → str[source]¶

Return just name.

Parameters:

title_and_name: str: e.g. Mayor Ted Wheeler

Returns:

name: str: tile_name_name with all title-related words removed e.g. Ted Wheeler

cdp_scrapers.instances.seattle module¶

class cdp_scrapers.instances.seattle.SeattleScraper[source]¶

Bases: LegistarScraper

Seattle specific implementation of LegistarScraper.

PYTHON_MUNICIPALITY_SLUG: str = 'seattle'¶

static dump_static_info(file_path: str) → bool[source]¶

Save static data in json format.

Parameters:

file_path: str: Static data dump file path

Returns:

bool: True if some data was saved in file_path

See also

LegistarScraper.inject_known_data

get_content_uris(legistar_ev: dict) → list[ContentURIs][source]¶

Return URLs for videos and captions parsed from seattlechannel.org web page.

Parameters:

legistar_ev: Dict: Data for one Legistar Event.

Returns:

content_uris: List[ContentURIs]: List of ContentURIs objects for each session found.

See also

parse_content_uris

Notes

get_events() calls get_content_uris() to get video and caption URIs. get_content_uris() gets video page URL from EventInSiteURL. If “videoid” in video page URL, calls parse_content_uris(). Else, calls get_video_page_urls() to get proper video page URL with “videoid”,

then calls parse_content_uris().

get_events()

-> get_content_uris(): -> parse_content_uris() or -> get_video_page_urls(), parse_content_uris()

static get_person_picture_url(person_www: str) → str | None[source]¶

Parse person_www and return banner image used on the web page.

Parameters:

person_www: str: e.g. http://www.seattle.gov/council/pedersen

Returns:

Image URL: Optional[str]: Full URL to banner image displayed on person_www

static get_static_person_info() → list[Person] | None[source]¶

Return partial Persons with static long-term information.

Returns:

persons: Optional[List[Person]]

get_video_page_urls(video_list_page_url: str, event_short_date: str) → list[str][source]¶

Return URLs to web pages hosting videos for meetings from event_short_date.

Parameters:

video_list_page_url: str: URL to web page listing videos featuring the responsible group/body for the event described in legistar_ev. e.g. http://www.seattlechannel.org/BudgetCommittee?Mode2=Video
event_short_date: str: datetime representing the meeting’s date m/d/yy

Returns:

video_page_urls: List[str]: web page URL per video

See also

get_content_uris

parse_content_uris(video_page_url: str, event_short_date: str) → list[ContentURIs][source]¶

Return URLs for videos and captions parsed from seattlechannel.org web page.

Parameters:

video_page_url: str: URL to a web page for a particular meeting video
event_short_date: str: datetime representing the meeting’s date, used for verification m/d/yy

Returns:

content_uris: List[ContentURIs]: List of ContentURIs objects for each session found.

Raises:

VideoIdMismatchError: If date on the video web page does not match the event date.

See also

get_content_uris

static roman_to_int(roman: str)[source]¶

Roman numeral to an integer.

Parameters:

roman: str: Roman numeral string

Returns:

int: Input roman numeral as integer

References

https://www.w3resource.com/python-exercises/class-exercises/python-class-exercise-2.php

exception cdp_scrapers.instances.seattle.VideoIdMismatchError[source]¶: Bases: ValueError

Module contents¶

Individual scratchpad and maybe up-to-date CDP instance scrapers.

cdp_scrapers.instances.get_king_county_events(from_dt: ~datetime.datetime, to_dt: ~datetime.datetime, *, legistar_scraper: ~typing.Type[~cdp_scrapers.legistar_utils.LegistarScraper] = <class 'cdp_scrapers.instances.kingcounty.KingCountyScraper'>, **kwargs: ~typing.Any) → List[EventIngestionModel]¶

cdp_scrapers.instances.get_seattle_events(from_dt: ~datetime.datetime, to_dt: ~datetime.datetime, *, legistar_scraper: ~typing.Type[~cdp_scrapers.legistar_utils.LegistarScraper] = <class 'cdp_scrapers.instances.seattle.SeattleScraper'>, **kwargs: ~typing.Any) → List[EventIngestionModel]¶

cdp_scrapers.instances.scraper_get_events(from_dt: ~datetime.datetime, to_dt: ~datetime.datetime, *, legistar_scraper: ~typing.Type[~cdp_scrapers.legistar_utils.LegistarScraper] = <class 'cdp_scrapers.instances.seattle.SeattleScraper'>, **kwargs: ~typing.Any) → List[EventIngestionModel]¶