Author(s) | Tim Showers [email protected] |
Implementer(s) | Tim Showers, James Turk |
Status | Draft |
Issue | https://github.com/openstates/enhancement-proposals/issues/TBD |
Draft PR(s) | https://github.com/openstates/enhancement-proposals/pull/TBD |
Approval PR(s) | https://github.com/openstates/enhancement-proposals/pull/TBD |
Created | 2021-11-22 |
Updated | 2021-11-22 |
Scraping bill data in many states requires some form of session-specific metadata. Generally this is a string or integer session ID. To ease creating new sessions and clean up the code, these should be stored in the same top-level session object as other session metadata.
The LegislativeSession
data model should be updated to allow an extras
dict, to match the behavior of existing fields with 'extras'.
EX: The Alabama 2021 Regular session LegislativeSession would change from:
{
"_scraped_name": "Regular Session 2021",
"classification": "primary",
"identifier": "2021rs",
"name": "2021 Regular Session",
"start_date": "2021-02-02",
"end_date": "2021-05-18",
},
to
{
"_scraped_name": "Regular Session 2021",
"classification": "primary",
"identifier": "2021rs",
"name": "2021 Regular Session",
"start_date": "2021-02-02",
"end_date": "2021-05-18",
"extras": {
# found in select#current_session at
# http://alisondb.legislature.state.al.us/alison/SelectSession.aspx
"session_id": "1076"
}
},
Currently we store extra session metadata inconsistently. Sometimes it's a constant dict at the top of various scrapers, sometimes it's included in a jurisdiction-specific common library file, sometimes it's inline in the code as a variable.
This makes creating new sessions more error-prone, as users can't just copy/paste a previous legislative_sessions dict and update the keys. It leads to hunting down variables in code rather than having a standardized place. It can also be a source of errors when a scrape will complete without that ID, but link to broken versions or sources. Oftentimes the source of these IDs is not clear to the reader, where a standard spot would give an obvious place to leave a comment.
This proposal is for an extras dict rather than a simple top-level "session_id" variable, because there could be cases where we need multiple extra variables. This is most likely to happen when a session encodes both a regular and special session ID in its urls.
We occasionally also need things like special URL slugs, "/2021ss2/" vs "/2021/" which we are currently handling by if loops to check for specials and substitute.
The field is named extras rather than something more descriptive to maintain consistency with existing key-value stores in the data model.
We could still end up with inconsistent meta key names, and this doesn't include a formal place for tracking down new IDs.
GovHawk would update the metadata in January or February 2022, TBD from openstates-core group would create the necessary django migrations and update relevant packages. GovHawk to modify scrapers to move to new system as time allows.
This document has been placed in the public domain per the Creative Commons CC0 1.0 Universal license.