cdp_scrapers.instances package

Submodules

cdp_scrapers.instances.atlanta module

cdp_scrapers.instances.atlanta.assign_constant(driver: WebDriver, i: int, j: int, vote_decision: str, voting_list: list, body_name: str, persons: dict)[source]

Assign constants and add Vote to the ingestion models based on the vote decision.

Parameters:
driver:webdriver

webdriver of the matter page

i: int

tr[i] is the matter we are looking at

j: int

the row number of the information in a matter that we are looking at

vote_decision: str

the vote decision constant of the vote decision

voting_list: list

the list that contains vote ingestion models

body_name: str

the body name of the current meeting

persons: dict

Dict[str, ingestion_models.Person]

cdp_scrapers.instances.atlanta.convert_status_constant(decision: str) str[source]

Converts the matter result status to the exsiting constants.

Parameters:
decision: str

decision of the matter

Returns:
db_constants

result status constants

cdp_scrapers.instances.atlanta.get_date(driver: WebDriver, url: str, from_dt: datetime, to_dt: datetime) list[source]

Get a list of ingestion models for the meetings hold during the selected time range.

Parameters:
driver:webdriver

empty webdriver

url:str

the url of the calender page

from_dt:

the begin date

to_dt:

the end date

Returns:
list

all the ingestion models for the selected date range

cdp_scrapers.instances.atlanta.get_events(from_dt: datetime, to_dt: datetime) list[source]

gets the right calender link feed it to the function that get a list of ingestion models.

Parameters:
from_dt:

the begin date

to_dt:

the end date

Returns:
list

all the ingestion models for the selected date range

cdp_scrapers.instances.atlanta.get_matter_status(driver: WebDriver, i: int) Tuple[list, str][source]

Find the matter result status.

Parameters:
driver:webdriver

webdriver of the matter page

i: int

tracker used to loop the rows in the matter page

Returns:
sub_sections: element

the block under the matter for the current date

decision_constant: element

the matter decision constant

cdp_scrapers.instances.atlanta.get_new_person(name: str) Person[source]

Creates the person ingestion model for the people that are not recored.

Parameters:
name:str

the name of the person

Returns:
ingestion model

the person ingestion model for the newly appeared person

cdp_scrapers.instances.atlanta.get_person() dict[source]

Put the informtion get by get_single_person() to dictionary.

Returns:
dictionary

key: person’s name value: person’s ingestion model

cdp_scrapers.instances.atlanta.get_single_person(driver: WebDriver, member_name: str) Person[source]

Get all the information fot one person Includes: role, seat, picture, phone, email.

Parameters:
driver:

webdriver calling the people’s dictionary page

member_name:

person’s name

Returns:
ingestion_models

the ingestion model for the person’s part

cdp_scrapers.instances.atlanta.get_voting_result(driver: WebDriver, sub_sections_len: int, i: int, body_name: str, persons: dict) list[source]

Scrapes and converts the voting decisions to the exsiting constants.

Parameters:
driver:webdriver

webdriver of the matter page

sub_sections_len: int

the row number in the block under the matter for the current date

i: int

tr[i] is the matter we are looking at

body_name: str

the body name of the current meeting

persons: dict

Dict[str, ingestion_models.Person]

Returns:
list

contains the Vote ingestion model for each person

cdp_scrapers.instances.atlanta.get_year(driver: WebDriver, url: str, from_dt: datetime) str[source]

Navigate to the year that we are looking for.

Parameters:
driver:webdriver

empty webdriver

url:str

the url of the calender page

from_dt:datetime

the datetime object for the search target year

Returns:
link:str

the link to the calender of the year that we are looking for

cdp_scrapers.instances.atlanta.parse_event(url: str) EventIngestionModel[source]

Scrapes all the information for a meeting.

Parameters:
url:str

the url of the meeting that we want to scrape

Returns:
ingestion model

the ingestion model for the meeting

cdp_scrapers.instances.atlanta.parse_single_matter(driver: WebDriver, test: str, item: str, body_name: str, s_word_formated: datetime, persons: dict) EventMinutesItem[source]

Get the minute items that contains a matter.

Parameters:
driver:webdriver

webdriver of the matter page

matter:element

the matter we are looking at

body_name: str

the body name of the current meeting

s_word_formated: datetime

the date of the current meeting

persons: dict

Dict[str, ingestion_models.Person]

Returns:
ingestion model

minutes ingestion model with the matters information

cdp_scrapers.instances.empty module

cdp_scrapers.instances.empty.get_events(from_dt: datetime, to_dt: datetime, **kwargs: Any) List[EventIngestionModel][source]

Get all events for the provided timespan.

Parameters:
from_dt: datetime

Datetime to start event gather from.

to_dt: datetime

Datetime to end event gather at.

kwargs: Any

Any keyword arguments to provide to downstream functions.

Returns:
events: List[EventIngestionModel]

All events gathered that occured in the provided time range.

Notes

As the implimenter of the get_events function, you can choose to ignore the from_dt and to_dt parameters. However, they are useful for manually kicking off pipelines from GitHub Actions UI.

cdp_scrapers.instances.houston module

class cdp_scrapers.instances.houston.AgendaType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

Pdf = 2
WebPage = 1
class cdp_scrapers.instances.houston.HoustonScraper[source]

Bases: IngestionModelScraper

get_agenda(element: Tag) Tag | NavigableString | None[source]

Get event agenda for a specific details page.

Parameters:
element: Tag

The element from which we want to get agenda

Returns:
AgendaType, Tag

Resource type for the agenda and the agenda resource itself

get_all_elements_in_range(time_from: datetime, time_to: datetime) Dict[str, Tag][source]

Get all the meetings in a range of dates.

Parameters:
time_from: datetime

Earliest meeting date to look at

time_to: datetime

Latest meeting date to look at

Returns:
Dict[str, Tag]

Dictionary of mapping between the date of the meeting and the element for the meeting in that date

get_body_name(event: Tag | NavigableString | None) str[source]

Get the body name for an event.

Parameters:
event: Union[Tag, NavigableString, None]

All elements in the page that we want to scrape

Returns:
str

The body name

Find the main link for one event.

Parameters:
element: Tag

The element of one event

Returns:
str

The main link for this event

get_diff_yearid(event_date: datetime) str[source]

Get the events in different years as the events for different years are stored in different tabs. Can get multiple events across years.

Parameters:
event_date: datetime

The date of the event we are trying to parse

Returns:
str

The year id that can locate the year tab where the event is stored

get_event(date: str, element: Tag) EventIngestionModel[source]

Parse one event at a specific date. City council meeting information for a specific date.

Parameters:
date: str

the date of this meeting

element: Tag

the meeting Tag element

Returns:
ingestion_models.EventIngestionModel

EventIngestionModel for one meeting date

get_event_minutes_item(event: Tag | NavigableString | None) List[EventMinutesItem][source]

Parse the page and gather the event minute items.

Parameters:
event: Union[Tag, NavigableString, None]

All elements in the page that we want to scrape

Returns:
List[ingestion_models.EventMinutesItem]

All the event minute items gathered from the event on the page

get_events(from_dt: datetime, to_dt: datetime) List[EventIngestionModel][source]

Get all city council meetings information within a specific time range.

Parameters:
from_dt: datetime

The start date of the time range

to_dt: datetime

The end date of the time range

Returns:
list[ingestion_models.EventIngestionModel]

A list of EventIngestionModel that contains all city council meetings information within a specific time range

remove_extra_type(element: Tag | NavigableString | None) Tag[source]

Remove types that are not useful.

Parameters:
element: Union[Tag, NavigableString, None]

The element in the page that we want to scrape

Returns:
Tag

Same elements as received, assuming the elements are not null

cdp_scrapers.instances.houston.get_houston_events(from_dt: datetime | None = None, to_dt: datetime | None = None, **kwargs: Any) List[EventIngestionModel][source]

Public API for use in instances.__init__ so that this func can be attached as an attribute to cdp_scrapers.instances module. Thus the outside world like cdp-backend can get at this by asking for “get_portland_events”.

Parameters:
from_dt: datetime, optional

The timespan beginning datetime to query for events after. Default is 2 days from UTC now

to_dt: datetime, optional

The timespan end datetime to query for events before. Default is UTC now

kwargs: Any

Any extra keyword arguments to pass to the get_events function.

Returns:
events: List[EventIngestionModel]

See also

cdp_scrapers.instances.__init__.py

cdp_scrapers.instances.kingcounty module

class cdp_scrapers.instances.kingcounty.KingCountyScraper[source]

Bases: LegistarScraper

King County specific implementation of LegistarScraper.

PYTHON_MUNICIPALITY_SLUG: str = 'king_county'
static dump_static_info(file_path: Path) None[source]

Call this to save current council members information as Persons in json format to file_path. Intended to be called once every N years when the council changes.

Parameters:
file_path: Path

output json file path

static get_static_person_info() Dict[str, Person][source]

Scrape current council members information from kingcounty.gov.

Returns:
persons: Dict[str, Person]

keyed by name

Notes

Parse https://kingcounty.gov/council/councilmembers/find_district.aspx that contains list of current council members name, position, contact info

cdp_scrapers.instances.lacity module

class cdp_scrapers.instances.lacity.LosAngelesScraper[source]

Bases: PrimeGovScraper

LA, CA specific implementation of PrimeGovScraper.

PYTHON_MUNICIPALITY_SLUG: str = 'lacity'
cdp_scrapers.instances.lacity.get_lacity_events(from_dt: datetime | None = None, to_dt: datetime | None = None, **kwargs: Any) List[EventIngestionModel][source]

Public API for use in instances.__init__ so that this func can be attached as an attribute to cdp_scrapers.instances module. Thus the outside world like cdp-backend can get at this by asking for “get_lacity_events”.

Parameters:
from_dt: datetime, optional

The timespan beginning datetime to query for events after. Default is 2 days from UTC now

to_dt: datetime, optional

The timespan end datetime to query for events before. Default is UTC now

kwargs: Any

Any extra keyword arguments to pass to the get events function.

Returns:
events: List[EventIngestionModel]

See also

cdp_scrapers.instances.__init__.py

cdp_scrapers.instances.portland module

class cdp_scrapers.instances.portland.PortlandScraper[source]

Bases: IngestionModelScraper

get_agenda_uri(event_page: BeautifulSoup) str | None[source]

Find the uri for the file containing the agenda for a Portland, OR city council meeting.

Parameters:
event_page: BeautifulSoup

Web page for the meeting loaded as a bs4 object

Returns:
agenda_uri: Optional[str]

The uri for the file containing the meeting’s agenda

get_doc_number(minute_section: Tag, event_page: BeautifulSoup) str[source]

Find the document number in the minute_section.

Parameters:
minute_section: Tag

<div> within event web page for a given event minute item

event_page: BeautifulSoup

The entire page where the event is found

Returns:
doc_number: str

The document number in the minute_section If this is null, return the section top number with the year

get_event(event_time: datetime) EventIngestionModel | None[source]

Portland, OR city council meeting information for a specific date.

Parameters:
event_time: datetime

Meeting date

Returns:
Optional[EventIngestionModel]

None if there was no meeting on event_time or information for the meeting did not meet minimal CDP requirements.

get_event_minutes(event_page: BeautifulSoup) list[EventMinutesItem] | None[source]

Make EventMinutesItem from each relation–type-agenda-item <div> on event_page.

Parameters:
event_page: BeautifulSoup

Web page for the meeting loaded as a bs4 object

Returns:
event minute items: Optional[List[EventMinutesItem]]
get_events(begin: datetime | None = None, end: datetime | None = None) list[EventIngestionModel][source]

Portland, OR city council meeting information over given time span as List[EventIngestionModel].

Parameters:
begin: datetime, optional

The timespan beginning datetime to query for events after. Default is 2 days from UTC now

end: datetime, optional

The timespan end datetime to query for events before. Default is UTC now

Returns:
events: List[EventIngestionModel]

References

https://www.portland.gov/council/agenda/all

get_matter(minute_section: Tag, event_page: BeautifulSoup) Matter | None[source]

Make Matter from information in minute_section.

Parameters:
minute_section: Tag

<div> within event web page for a given event minute item

event_page: BeautifulSoup

The entire page where the event is found

Returns:
matter: Optional[Matter]

Matter if required information could be parsed from minute_section

get_person(name: str) Person[source]

Return matching Person from portland-static.json.

Parameters:
name: str

Person full name

Returns:
person: Person

Matching Person from portland-static.json

Raises:
KeyError

If name does not exist in portland-static.json

References

portland-static.json

get_section_top_number(minute_section: Tag, event_page: BeautifulSoup) str[source]

Find the top section number in the minute_section.

Parameters:
minute_section: Tag

<div> within event web page for a given event minute item

event_page: BeautifulSoup

The entire page where the event is found

Returns:
doc_number: str

The top section number in the minute_section, with the year appended at the end

get_sessions(event_page: BeautifulSoup) list[Session] | None[source]

Parse meeting video URIs from event_page, return Session for each video found.

Parameters:
event_page: BeautifulSoup

Web page for the meeting loaded as a bs4 object

Returns:
sessions: Optional[List[Session]]

Session for each video found on event_page

get_supporting_files(minute_section: Tag) list[SupportingFile] | None[source]

Return SupportingFiles for a given EventMinutesItem.

Parameters:
minute_section: Tag

<div> within event web page for a given event minute item

Returns:
supporting files: Optional[List[SupportingFile]]

See also

make_efile_url

Notes

Follow hyperlink to go to minutes item details page. On the details page look for directly-linked files and externally-hosted efiles.

get_votes(minute_section: Tag) list[Vote] | None[source]

Look for ‘Votes:’ in minute_section and create a Vote object for each line.

Parameters:
minute_section: Tag

<div> within event web page for a given event minute item

Returns:
votes: Optional[List[Vote]]

Votes for corresponding event minute item if found

class cdp_scrapers.instances.portland.WebPageSoup(status, soup)[source]

Bases: NamedTuple

Create new instance of WebPageSoup(status, soup)

soup: BeautifulSoup | None

Alias for field number 1

status: bool

Alias for field number 0

cdp_scrapers.instances.portland.disposition_to_minute_decision(disposition: str) EventMinutesItemDecision | None[source]

Decide EventMinutesItemDecision constant from event minute item disposition.

Parameters:
disposition: str

Disposition event web page for a given item e.g. Passed, Continued

Returns:
decision: Optional[EventMinutesItemDecision]

See also

MINUTE_ITEM_PASSED_PATTERNS
cdp_scrapers.instances.portland.get_disposition(minute_section: Tag) str[source]

Return disposition string given within minute_section <div> on the event web page.

Parameters:
minute_section: Tag

<div> within event web page for a given event minute item

Returns:
disposition: str

Disposition string for the event minute item e.g. Accepted, Passed, Placed on file

cdp_scrapers.instances.portland.get_portland_events(from_dt: datetime | None = None, to_dt: datetime | None = None, **kwargs: Any) list[EventIngestionModel][source]

Public API for use in instances.__init__ so that this func can be attached as an attribute to cdp_scrapers.instances module. Thus the outside world like cdp-backend can get at this by asking for “get_portland_events”.

Parameters:
from_dt: datetime, optional

The timespan beginning datetime to query for events after. Default is 2 days from UTC now

to_dt: datetime, optional

The timespan end datetime to query for events before. Default is UTC now

kwargs: Any

Any extra keywords arguments to pass to the get events function.

Returns:
events: List[EventIngestionModel]

See also

cdp_scrapers.instances.__init__.py
cdp_scrapers.instances.portland.load_web_page(url: str | Request) WebPageSoup[source]

Load web page at url and return content soupified.

Parameters:
url: str | urllib.request.Request

Web page to load

Returns:
result: WebPageSoup

WebPageSoup.status = False if web page at url could not be loaded

cdp_scrapers.instances.portland.make_efile_url(efile_page_url: str) str[source]

Helper function to get file download link on a Portland EFile hosting web page.

Parameters:
efile_page_url: str

URL to Portland efile hosting web page e.g. https://efiles.portlandoregon.gov/record/14803529

Returns:
efile url: str

URL to the file itself e.g. https://efiles.portlandoregon.gov/record/14803529/File/Document

cdp_scrapers.instances.portland.separate_name_from_title(title_and_name: str) str[source]

Return just name.

Parameters:
title_and_name: str

e.g. Mayor Ted Wheeler

Returns:
name: str

tile_name_name with all title-related words removed e.g. Ted Wheeler

cdp_scrapers.instances.seattle module

class cdp_scrapers.instances.seattle.SeattleScraper[source]

Bases: LegistarScraper

Seattle specific implementation of LegistarScraper.

PYTHON_MUNICIPALITY_SLUG: str = 'seattle'
static dump_static_info(file_path: str) bool[source]

Save static data in json format.

Parameters:
file_path: str

Static data dump file path

Returns:
bool

True if some data was saved in file_path

See also

LegistarScraper.inject_known_data
get_content_uris(legistar_ev: dict) list[ContentURIs][source]

Return URLs for videos and captions parsed from seattlechannel.org web page.

Parameters:
legistar_ev: Dict

Data for one Legistar Event.

Returns:
content_uris: List[ContentURIs]

List of ContentURIs objects for each session found.

Notes

get_events() calls get_content_uris() to get video and caption URIs. get_content_uris() gets video page URL from EventInSiteURL. If “videoid” in video page URL, calls parse_content_uris(). Else, calls get_video_page_urls() to get proper video page URL with “videoid”,

then calls parse_content_uris().

get_events()
-> get_content_uris()

-> parse_content_uris() or -> get_video_page_urls(), parse_content_uris()

static get_person_picture_url(person_www: str) str | None[source]

Parse person_www and return banner image used on the web page.

Parameters:
person_www: str

e.g. http://www.seattle.gov/council/pedersen

Returns:
Image URL: Optional[str]

Full URL to banner image displayed on person_www

static get_static_person_info() list[Person] | None[source]

Return partial Persons with static long-term information.

Returns:
persons: Optional[List[Person]]
get_video_page_urls(video_list_page_url: str, event_short_date: str) list[str][source]

Return URLs to web pages hosting videos for meetings from event_short_date.

Parameters:
video_list_page_url: str

URL to web page listing videos featuring the responsible group/body for the event described in legistar_ev. e.g. http://www.seattlechannel.org/BudgetCommittee?Mode2=Video

event_short_date: str

datetime representing the meeting’s date m/d/yy

Returns:
video_page_urls: List[str]

web page URL per video

See also

get_content_uris
parse_content_uris(video_page_url: str, event_short_date: str) list[ContentURIs][source]

Return URLs for videos and captions parsed from seattlechannel.org web page.

Parameters:
video_page_url: str

URL to a web page for a particular meeting video

event_short_date: str

datetime representing the meeting’s date, used for verification m/d/yy

Returns:
content_uris: List[ContentURIs]

List of ContentURIs objects for each session found.

Raises:
VideoIdMismatchError

If date on the video web page does not match the event date.

See also

get_content_uris
static roman_to_int(roman: str)[source]

Roman numeral to an integer.

Parameters:
roman: str

Roman numeral string

Returns:
int

Input roman numeral as integer

References

https://www.w3resource.com/python-exercises/class-exercises/python-class-exercise-2.php

exception cdp_scrapers.instances.seattle.VideoIdMismatchError[source]

Bases: ValueError

Module contents

Individual scratchpad and maybe up-to-date CDP instance scrapers.

cdp_scrapers.instances.get_king_county_events(from_dt: ~datetime.datetime, to_dt: ~datetime.datetime, *, legistar_scraper: ~typing.Type[~cdp_scrapers.legistar_utils.LegistarScraper] = <class 'cdp_scrapers.instances.kingcounty.KingCountyScraper'>, **kwargs: ~typing.Any) List[EventIngestionModel]
cdp_scrapers.instances.get_seattle_events(from_dt: ~datetime.datetime, to_dt: ~datetime.datetime, *, legistar_scraper: ~typing.Type[~cdp_scrapers.legistar_utils.LegistarScraper] = <class 'cdp_scrapers.instances.seattle.SeattleScraper'>, **kwargs: ~typing.Any) List[EventIngestionModel]
cdp_scrapers.instances.scraper_get_events(from_dt: ~datetime.datetime, to_dt: ~datetime.datetime, *, legistar_scraper: ~typing.Type[~cdp_scrapers.legistar_utils.LegistarScraper] = <class 'cdp_scrapers.instances.seattle.SeattleScraper'>, **kwargs: ~typing.Any) List[EventIngestionModel]