Design Documentation

System

diagram

STATUS: FOR DISCUSSION

When the scraper has finished running:

For any new Ariba records, or any records for which the size of the ZIP file doesn't match its file size last time the script ran:
- Ariba files (unzipped) have been downloaded and saved to Google Drive in a folder following the format specified below
- The Ariba page is saved as an html file to the same folder
- XML fragment from OCDS data is saved to the same folder
- Data is sent to the DB API (INSERT OR UPDATE IF EXISTS): OCDS data, list of public URLs for file attachments, and parsed text from each file
For any new OCDS records which haven't already been sent to the DB API: send the data to the DB API
For any OCDS records which don't match the data we have ion our DB, send the data to the DB API
A single CSV(?) file containing the ZIP file size for each Ariba record is updated on Google Drive
Output a log to a new slack channel, including:
- Date/time, "ID" of host running the scraper
- Any errors encountered
  - Including records founds in Ariba and not OCDS, or vice versa
- New records added (UID)
- Records/files updated (UID and old + new ZIP file sizes)