Skip to content

Design Documentation

Gabe Sawhney edited this page Apr 8, 2023 · 9 revisions

System

diagram

Scraper

STATUS: FOR DISCUSSION

When the scraper has finished running:

  • For any new Ariba records, or any records for which the size of the ZIP file doesn't match its file size last time the script ran:
    • Ariba files (unzipped) have been downloaded and saved to Google Drive in a folder following the format specified below
    • The Ariba page is saved as an html file to the same folder
    • XML fragment from OCDS data is saved to the same folder
    • Data is sent to the DB API (INSERT OR UPDATE IF EXISTS): OCDS data, list of public URLs for file attachments, and parsed text from each file
  • For any new OCDS records which haven't already been sent to the DB API: send the data to the DB API
  • For any OCDS records which don't match the data we have ion our DB, send the data to the DB API
  • A single CSV(?) file containing the ZIP file size for each Ariba record is updated on Google Drive
  • Output a log to a new slack channel, including:
    • Date/time, "ID" of host running the scraper
    • Any errors encountered
      • Including records founds in Ariba and not OCDS, or vice versa
    • New records added (UID)
    • Records/files updated (UID and old + new ZIP file sizes)
Clone this wiki locally