Skip to content

Workflow

Esmé Cowles edited this page Jun 6, 2022 · 6 revisions

Workflow

This document describes an imagined workflow for recovering hidden names in archival resources.

graph LR
  A[Identify Need] --> B[Perform Preliminary Analysis]
  B --> C[Define Corpus]
  C --> D[Perform Initial Digitization*]
  D --> E[Perform OCR]
  E --> F[Perform Name Identification]
  F --> G[Link Appellations to Entities]
  G --> H[Peform Corpus Analysis]   
Loading
  • Identify a need. Someone (an archivist; a reseacher; an auditor) discovers or suspects bias in a finding aid.
  • Perform a preliminary analysis. An archivist reviews the collection and the existing finding aid.
    • What information could an archivist/reviewer gather at this stage that would be helpful in the automated phases of the processing?
      • Languages.
      • Media Types. Typed pages? Manuscripts? Clippings? Mimeographs? Photocopies?
  • Define the corpus. Based on the preliminary analysis, create an inventory of archival resources: the collections, series, sub-collections, containers that should be processed.
  • Perform Initial Digitization, if necessary. If the materials have not already been photographed and ingested into Figgy, do so. The end result of this step is a set of IIIF Manifests representing the units to be processed.
  • Perform OCR. Figgy runs OCR on ingested images automatically (if they have the `OCR Language` property set), but specialized OCR, based on the preliminary analysis, may produce significantly better results. Furthermore, performing OCR as part of this workflow makes it possible to assign confidence scores to each page, which in turn enables later steps to ignore pages that don’t have (much) recognizable text.
  • Perform Name Identification. NB: this step is often called Named-Entity Recognition, or NER, but we are being careful with our terminology to avoid confusion. For our purposes, the process of identifying names in the corpus entails two steps:
    • Perform NLP. For each unit (i.e., for each IIIF Manifest) executing a typical natural-language-processing pipeline: tokenization, tagging, parsing, detecting, and labeling tokens as names. The output of this step is a graph of inscriptions (OCR text) of interest, linked to the IIIF canvases on which they appear.
    • Edit Identified Names. The set of inscriptions must be cleaned by hand in order to associate them with Appellations. This work entails examining ambiguous, misspelled, and unrecognized name strings and mapping them to one or more Appellations.
  • Link Appellations to Entities. This step is often called Named-Entity Linking. Where possible, link Appellations to existing authorized names (from VIAF, SNAC); otherwise, mint a new, local entity record. Incorporate these entities into the Knowledge Graph.
  • Perform Corpus Analysis. Run programs that use the Knowledge Graph to determine the frequency of an Entity’s occurrence in the corpus, plus other information that aids archivists in determining the relevance of each Entity to their goals.
Clone this wiki locally