cdp_scrapers.instances package¶
Submodules¶
cdp_scrapers.instances.atlanta module¶
- cdp_scrapers.instances.atlanta.assign_constant(driver: WebDriver, i: int, j: int, vote_decision: str, voting_list: list, body_name: str, persons: dict)[source]¶
Assign constants and add Vote to the ingestion models based on the vote decision.
- Parameters:
- driver:webdriver
webdriver of the matter page
- i: int
tr[i] is the matter we are looking at
- j: int
the row number of the information in a matter that we are looking at
- vote_decision: str
the vote decision constant of the vote decision
- voting_list: list
the list that contains vote ingestion models
- body_name: str
the body name of the current meeting
- persons: dict
Dict[str, ingestion_models.Person]
- cdp_scrapers.instances.atlanta.convert_status_constant(decision: str) str [source]¶
Converts the matter result status to the exsiting constants.
- Parameters:
- decision: str
decision of the matter
- Returns:
- db_constants
result status constants
- cdp_scrapers.instances.atlanta.get_date(driver: WebDriver, url: str, from_dt: datetime, to_dt: datetime) list [source]¶
Get a list of ingestion models for the meetings hold during the selected time range.
- Parameters:
- driver:webdriver
empty webdriver
- url:str
the url of the calender page
- from_dt:
the begin date
- to_dt:
the end date
- Returns:
- list
all the ingestion models for the selected date range
- cdp_scrapers.instances.atlanta.get_events(from_dt: datetime, to_dt: datetime) list [source]¶
gets the right calender link feed it to the function that get a list of ingestion models.
- Parameters:
- from_dt:
the begin date
- to_dt:
the end date
- Returns:
- list
all the ingestion models for the selected date range
- cdp_scrapers.instances.atlanta.get_matter_status(driver: WebDriver, i: int) Tuple[list, str] [source]¶
Find the matter result status.
- Parameters:
- driver:webdriver
webdriver of the matter page
- i: int
tracker used to loop the rows in the matter page
- Returns:
- sub_sections: element
the block under the matter for the current date
- decision_constant: element
the matter decision constant
- cdp_scrapers.instances.atlanta.get_new_person(name: str) Person [source]¶
Creates the person ingestion model for the people that are not recored.
- Parameters:
- name:str
the name of the person
- Returns:
- ingestion model
the person ingestion model for the newly appeared person
- cdp_scrapers.instances.atlanta.get_person() dict [source]¶
Put the informtion get by get_single_person() to dictionary.
- Returns:
- dictionary
key: person’s name value: person’s ingestion model
- cdp_scrapers.instances.atlanta.get_single_person(driver: WebDriver, member_name: str) Person [source]¶
Get all the information fot one person Includes: role, seat, picture, phone, email.
- Parameters:
- driver:
webdriver calling the people’s dictionary page
- member_name:
person’s name
- Returns:
- ingestion_models
the ingestion model for the person’s part
- cdp_scrapers.instances.atlanta.get_voting_result(driver: WebDriver, sub_sections_len: int, i: int, body_name: str, persons: dict) list [source]¶
Scrapes and converts the voting decisions to the exsiting constants.
- Parameters:
- driver:webdriver
webdriver of the matter page
- sub_sections_len: int
the row number in the block under the matter for the current date
- i: int
tr[i] is the matter we are looking at
- body_name: str
the body name of the current meeting
- persons: dict
Dict[str, ingestion_models.Person]
- Returns:
- list
contains the Vote ingestion model for each person
- cdp_scrapers.instances.atlanta.get_year(driver: WebDriver, url: str, from_dt: datetime) str [source]¶
Navigate to the year that we are looking for.
- Parameters:
- driver:webdriver
empty webdriver
- url:str
the url of the calender page
- from_dt:datetime
the datetime object for the search target year
- Returns:
- link:str
the link to the calender of the year that we are looking for
- cdp_scrapers.instances.atlanta.parse_event(url: str) EventIngestionModel [source]¶
Scrapes all the information for a meeting.
- Parameters:
- url:str
the url of the meeting that we want to scrape
- Returns:
- ingestion model
the ingestion model for the meeting
- cdp_scrapers.instances.atlanta.parse_single_matter(driver: WebDriver, test: str, item: str, body_name: str, s_word_formated: datetime, persons: dict) EventMinutesItem [source]¶
Get the minute items that contains a matter.
- Parameters:
- driver:webdriver
webdriver of the matter page
- matter:element
the matter we are looking at
- body_name: str
the body name of the current meeting
- s_word_formated: datetime
the date of the current meeting
- persons: dict
Dict[str, ingestion_models.Person]
- Returns:
- ingestion model
minutes ingestion model with the matters information
cdp_scrapers.instances.empty module¶
- cdp_scrapers.instances.empty.get_events(from_dt: datetime, to_dt: datetime, **kwargs: Any) List[EventIngestionModel] [source]¶
Get all events for the provided timespan.
- Parameters:
- from_dt: datetime
Datetime to start event gather from.
- to_dt: datetime
Datetime to end event gather at.
- kwargs: Any
Any keyword arguments to provide to downstream functions.
- Returns:
- events: List[EventIngestionModel]
All events gathered that occured in the provided time range.
Notes
As the implimenter of the get_events function, you can choose to ignore the from_dt and to_dt parameters. However, they are useful for manually kicking off pipelines from GitHub Actions UI.
cdp_scrapers.instances.houston module¶
- class cdp_scrapers.instances.houston.AgendaType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
IntEnum
- Pdf = 2¶
- WebPage = 1¶
- class cdp_scrapers.instances.houston.HoustonScraper[source]¶
Bases:
IngestionModelScraper
- get_agenda(element: Tag) Tag | NavigableString | None [source]¶
Get event agenda for a specific details page.
- Parameters:
- element: Tag
The element from which we want to get agenda
- Returns:
- AgendaType, Tag
Resource type for the agenda and the agenda resource itself
- get_all_elements_in_range(time_from: datetime, time_to: datetime) Dict[str, Tag] [source]¶
Get all the meetings in a range of dates.
- Parameters:
- time_from: datetime
Earliest meeting date to look at
- time_to: datetime
Latest meeting date to look at
- Returns:
- Dict[str, Tag]
Dictionary of mapping between the date of the meeting and the element for the meeting in that date
- get_body_name(event: Tag | NavigableString | None) str [source]¶
Get the body name for an event.
- Parameters:
- event: Union[Tag, NavigableString, None]
All elements in the page that we want to scrape
- Returns:
- str
The body name
- get_date_mainlink(element: Tag) str [source]¶
Find the main link for one event.
- Parameters:
- element: Tag
The element of one event
- Returns:
- str
The main link for this event
- get_diff_yearid(event_date: datetime) str [source]¶
Get the events in different years as the events for different years are stored in different tabs. Can get multiple events across years.
- Parameters:
- event_date: datetime
The date of the event we are trying to parse
- Returns:
- str
The year id that can locate the year tab where the event is stored
- get_event(date: str, element: Tag) EventIngestionModel [source]¶
Parse one event at a specific date. City council meeting information for a specific date.
- Parameters:
- date: str
the date of this meeting
- element: Tag
the meeting Tag element
- Returns:
- ingestion_models.EventIngestionModel
EventIngestionModel for one meeting date
- get_event_minutes_item(event: Tag | NavigableString | None) List[EventMinutesItem] [source]¶
Parse the page and gather the event minute items.
- Parameters:
- event: Union[Tag, NavigableString, None]
All elements in the page that we want to scrape
- Returns:
- List[ingestion_models.EventMinutesItem]
All the event minute items gathered from the event on the page
- get_events(from_dt: datetime, to_dt: datetime) List[EventIngestionModel] [source]¶
Get all city council meetings information within a specific time range.
- Parameters:
- from_dt: datetime
The start date of the time range
- to_dt: datetime
The end date of the time range
- Returns:
- list[ingestion_models.EventIngestionModel]
A list of EventIngestionModel that contains all city council meetings information within a specific time range
- cdp_scrapers.instances.houston.get_houston_events(from_dt: datetime | None = None, to_dt: datetime | None = None, **kwargs: Any) List[EventIngestionModel] [source]¶
Public API for use in instances.__init__ so that this func can be attached as an attribute to cdp_scrapers.instances module. Thus the outside world like cdp-backend can get at this by asking for “get_portland_events”.
- Parameters:
- from_dt: datetime, optional
The timespan beginning datetime to query for events after. Default is 2 days from UTC now
- to_dt: datetime, optional
The timespan end datetime to query for events before. Default is UTC now
- kwargs: Any
Any extra keyword arguments to pass to the get_events function.
- Returns:
- events: List[EventIngestionModel]
See also
cdp_scrapers.instances.__init__.py
cdp_scrapers.instances.kingcounty module¶
- class cdp_scrapers.instances.kingcounty.KingCountyScraper[source]¶
Bases:
LegistarScraper
King County specific implementation of LegistarScraper.
- PYTHON_MUNICIPALITY_SLUG: str = 'king_county'¶
- static dump_static_info(file_path: Path) None [source]¶
Call this to save current council members information as Persons in json format to file_path. Intended to be called once every N years when the council changes.
- Parameters:
- file_path: Path
output json file path
- static get_static_person_info() Dict[str, Person] [source]¶
Scrape current council members information from kingcounty.gov.
- Returns:
- persons: Dict[str, Person]
keyed by name
Notes
Parse https://kingcounty.gov/council/councilmembers/find_district.aspx that contains list of current council members name, position, contact info
cdp_scrapers.instances.lacity module¶
- class cdp_scrapers.instances.lacity.LosAngelesScraper[source]¶
Bases:
PrimeGovScraper
LA, CA specific implementation of PrimeGovScraper.
- PYTHON_MUNICIPALITY_SLUG: str = 'lacity'¶
- cdp_scrapers.instances.lacity.get_lacity_events(from_dt: datetime | None = None, to_dt: datetime | None = None, **kwargs: Any) List[EventIngestionModel] [source]¶
Public API for use in instances.__init__ so that this func can be attached as an attribute to cdp_scrapers.instances module. Thus the outside world like cdp-backend can get at this by asking for “get_lacity_events”.
- Parameters:
- from_dt: datetime, optional
The timespan beginning datetime to query for events after. Default is 2 days from UTC now
- to_dt: datetime, optional
The timespan end datetime to query for events before. Default is UTC now
- kwargs: Any
Any extra keyword arguments to pass to the get events function.
- Returns:
- events: List[EventIngestionModel]
See also
cdp_scrapers.instances.__init__.py
cdp_scrapers.instances.portland module¶
- class cdp_scrapers.instances.portland.PortlandScraper[source]¶
Bases:
IngestionModelScraper
- get_agenda_uri(event_page: BeautifulSoup) str | None [source]¶
Find the uri for the file containing the agenda for a Portland, OR city council meeting.
- Parameters:
- event_page: BeautifulSoup
Web page for the meeting loaded as a bs4 object
- Returns:
- agenda_uri: Optional[str]
The uri for the file containing the meeting’s agenda
- get_doc_number(minute_section: Tag, event_page: BeautifulSoup) str [source]¶
Find the document number in the minute_section.
- Parameters:
- minute_section: Tag
<div> within event web page for a given event minute item
- event_page: BeautifulSoup
The entire page where the event is found
- Returns:
- doc_number: str
The document number in the minute_section If this is null, return the section top number with the year
- get_event(event_time: datetime) EventIngestionModel | None [source]¶
Portland, OR city council meeting information for a specific date.
- Parameters:
- event_time: datetime
Meeting date
- Returns:
- Optional[EventIngestionModel]
None if there was no meeting on event_time or information for the meeting did not meet minimal CDP requirements.
- get_event_minutes(event_page: BeautifulSoup) list[EventMinutesItem] | None [source]¶
Make EventMinutesItem from each relation–type-agenda-item <div> on event_page.
- Parameters:
- event_page: BeautifulSoup
Web page for the meeting loaded as a bs4 object
- Returns:
- event minute items: Optional[List[EventMinutesItem]]
- get_events(begin: datetime | None = None, end: datetime | None = None) list[EventIngestionModel] [source]¶
Portland, OR city council meeting information over given time span as List[EventIngestionModel].
- Parameters:
- begin: datetime, optional
The timespan beginning datetime to query for events after. Default is 2 days from UTC now
- end: datetime, optional
The timespan end datetime to query for events before. Default is UTC now
- Returns:
- events: List[EventIngestionModel]
References
- get_matter(minute_section: Tag, event_page: BeautifulSoup) Matter | None [source]¶
Make Matter from information in minute_section.
- Parameters:
- minute_section: Tag
<div> within event web page for a given event minute item
- event_page: BeautifulSoup
The entire page where the event is found
- Returns:
- matter: Optional[Matter]
Matter if required information could be parsed from minute_section
- get_person(name: str) Person [source]¶
Return matching Person from portland-static.json.
- Parameters:
- name: str
Person full name
- Returns:
- person: Person
Matching Person from portland-static.json
- Raises:
- KeyError
If name does not exist in portland-static.json
References
portland-static.json
- get_section_top_number(minute_section: Tag, event_page: BeautifulSoup) str [source]¶
Find the top section number in the minute_section.
- Parameters:
- minute_section: Tag
<div> within event web page for a given event minute item
- event_page: BeautifulSoup
The entire page where the event is found
- Returns:
- doc_number: str
The top section number in the minute_section, with the year appended at the end
- get_sessions(event_page: BeautifulSoup) list[Session] | None [source]¶
Parse meeting video URIs from event_page, return Session for each video found.
- Parameters:
- event_page: BeautifulSoup
Web page for the meeting loaded as a bs4 object
- Returns:
- sessions: Optional[List[Session]]
Session for each video found on event_page
- get_supporting_files(minute_section: Tag) list[SupportingFile] | None [source]¶
Return SupportingFiles for a given EventMinutesItem.
- Parameters:
- minute_section: Tag
<div> within event web page for a given event minute item
- Returns:
- supporting files: Optional[List[SupportingFile]]
See also
Notes
Follow hyperlink to go to minutes item details page. On the details page look for directly-linked files and externally-hosted efiles.
- get_votes(minute_section: Tag) list[Vote] | None [source]¶
Look for ‘Votes:’ in minute_section and create a Vote object for each line.
- Parameters:
- minute_section: Tag
<div> within event web page for a given event minute item
- Returns:
- votes: Optional[List[Vote]]
Votes for corresponding event minute item if found
- class cdp_scrapers.instances.portland.WebPageSoup(status, soup)[source]¶
Bases:
NamedTuple
Create new instance of WebPageSoup(status, soup)
- soup: BeautifulSoup | None¶
Alias for field number 1
- status: bool¶
Alias for field number 0
- cdp_scrapers.instances.portland.disposition_to_minute_decision(disposition: str) EventMinutesItemDecision | None [source]¶
Decide EventMinutesItemDecision constant from event minute item disposition.
- Parameters:
- disposition: str
Disposition event web page for a given item e.g. Passed, Continued
- Returns:
- decision: Optional[EventMinutesItemDecision]
See also
MINUTE_ITEM_PASSED_PATTERNS
- cdp_scrapers.instances.portland.get_disposition(minute_section: Tag) str [source]¶
Return disposition string given within minute_section <div> on the event web page.
- Parameters:
- minute_section: Tag
<div> within event web page for a given event minute item
- Returns:
- disposition: str
Disposition string for the event minute item e.g. Accepted, Passed, Placed on file
- cdp_scrapers.instances.portland.get_portland_events(from_dt: datetime | None = None, to_dt: datetime | None = None, **kwargs: Any) list[EventIngestionModel] [source]¶
Public API for use in instances.__init__ so that this func can be attached as an attribute to cdp_scrapers.instances module. Thus the outside world like cdp-backend can get at this by asking for “get_portland_events”.
- Parameters:
- from_dt: datetime, optional
The timespan beginning datetime to query for events after. Default is 2 days from UTC now
- to_dt: datetime, optional
The timespan end datetime to query for events before. Default is UTC now
- kwargs: Any
Any extra keywords arguments to pass to the get events function.
- Returns:
- events: List[EventIngestionModel]
See also
cdp_scrapers.instances.__init__.py
- cdp_scrapers.instances.portland.load_web_page(url: str | Request) WebPageSoup [source]¶
Load web page at url and return content soupified.
- Parameters:
- url: str | urllib.request.Request
Web page to load
- Returns:
- result: WebPageSoup
WebPageSoup.status = False if web page at url could not be loaded
- cdp_scrapers.instances.portland.make_efile_url(efile_page_url: str) str [source]¶
Helper function to get file download link on a Portland EFile hosting web page.
- Parameters:
- efile_page_url: str
URL to Portland efile hosting web page e.g. https://efiles.portlandoregon.gov/record/14803529
- Returns:
- efile url: str
URL to the file itself e.g. https://efiles.portlandoregon.gov/record/14803529/File/Document
cdp_scrapers.instances.seattle module¶
- class cdp_scrapers.instances.seattle.SeattleScraper[source]¶
Bases:
LegistarScraper
Seattle specific implementation of LegistarScraper.
- PYTHON_MUNICIPALITY_SLUG: str = 'seattle'¶
- static dump_static_info(file_path: str) bool [source]¶
Save static data in json format.
- Parameters:
- file_path: str
Static data dump file path
- Returns:
- bool
True if some data was saved in file_path
See also
LegistarScraper.inject_known_data
- get_content_uris(legistar_ev: dict) list[ContentURIs] [source]¶
Return URLs for videos and captions parsed from seattlechannel.org web page.
- Parameters:
- legistar_ev: Dict
Data for one Legistar Event.
- Returns:
- content_uris: List[ContentURIs]
List of ContentURIs objects for each session found.
See also
Notes
get_events() calls get_content_uris() to get video and caption URIs. get_content_uris() gets video page URL from EventInSiteURL. If “videoid” in video page URL, calls parse_content_uris(). Else, calls get_video_page_urls() to get proper video page URL with “videoid”,
then calls parse_content_uris().
- get_events()
- -> get_content_uris()
-> parse_content_uris() or -> get_video_page_urls(), parse_content_uris()
- static get_person_picture_url(person_www: str) str | None [source]¶
Parse person_www and return banner image used on the web page.
- Parameters:
- person_www: str
- Returns:
- Image URL: Optional[str]
Full URL to banner image displayed on person_www
- static get_static_person_info() list[Person] | None [source]¶
Return partial Persons with static long-term information.
- Returns:
- persons: Optional[List[Person]]
- get_video_page_urls(video_list_page_url: str, event_short_date: str) list[str] [source]¶
Return URLs to web pages hosting videos for meetings from event_short_date.
- Parameters:
- video_list_page_url: str
URL to web page listing videos featuring the responsible group/body for the event described in legistar_ev. e.g. http://www.seattlechannel.org/BudgetCommittee?Mode2=Video
- event_short_date: str
datetime representing the meeting’s date m/d/yy
- Returns:
- video_page_urls: List[str]
web page URL per video
See also
- parse_content_uris(video_page_url: str, event_short_date: str) list[ContentURIs] [source]¶
Return URLs for videos and captions parsed from seattlechannel.org web page.
- Parameters:
- video_page_url: str
URL to a web page for a particular meeting video
- event_short_date: str
datetime representing the meeting’s date, used for verification m/d/yy
- Returns:
- content_uris: List[ContentURIs]
List of ContentURIs objects for each session found.
- Raises:
- VideoIdMismatchError
If date on the video web page does not match the event date.
See also
- static roman_to_int(roman: str)[source]¶
Roman numeral to an integer.
- Parameters:
- roman: str
Roman numeral string
- Returns:
- int
Input roman numeral as integer
References
https://www.w3resource.com/python-exercises/class-exercises/python-class-exercise-2.php
Module contents¶
Individual scratchpad and maybe up-to-date CDP instance scrapers.
- cdp_scrapers.instances.get_king_county_events(from_dt: ~datetime.datetime, to_dt: ~datetime.datetime, *, legistar_scraper: ~typing.Type[~cdp_scrapers.legistar_utils.LegistarScraper] = <class 'cdp_scrapers.instances.kingcounty.KingCountyScraper'>, **kwargs: ~typing.Any) List[EventIngestionModel] ¶
- cdp_scrapers.instances.get_seattle_events(from_dt: ~datetime.datetime, to_dt: ~datetime.datetime, *, legistar_scraper: ~typing.Type[~cdp_scrapers.legistar_utils.LegistarScraper] = <class 'cdp_scrapers.instances.seattle.SeattleScraper'>, **kwargs: ~typing.Any) List[EventIngestionModel] ¶
- cdp_scrapers.instances.scraper_get_events(from_dt: ~datetime.datetime, to_dt: ~datetime.datetime, *, legistar_scraper: ~typing.Type[~cdp_scrapers.legistar_utils.LegistarScraper] = <class 'cdp_scrapers.instances.seattle.SeattleScraper'>, **kwargs: ~typing.Any) List[EventIngestionModel] ¶