Skip to content

Latest commit

 

History

History
95 lines (77 loc) · 4.02 KB

README.md

File metadata and controls

95 lines (77 loc) · 4.02 KB

Web Archive Search Index API and UI

An API wrapper to the Elasticsearch index of web archival collections and a web UI to explore those indexes. A part of the story-indexer stack. Maintained as a separate repository for future legibility. This exposes an FastAPI-based API server and a Streamlit-based search UI (for quick testing). Both are managed as internal services as part of the Media Cloud Online News Archive.

ES Index

The API service expects the following ES index schema, where title and snippet fields must have the fielddata enabled (if they have the type text). This is currently defined in the story-indexer stack, but is replicated here for convenience (but might be out of date).

{
    "properties": {
        "original_url": {"type": "keyword"},
        "url": {"type": "keyword"},
        "normalized_url": {"type": "keyword"},
        "canonical_domain": {"type": "keyword"},
        "publication_date": {"type": "date", "ignore_malformed": true},
        "language": {"type": "text", "fields": {"keyword": {"type": "keyword"}}},
        "full_language": {"type": "keyword"},
        "text_extraction": {"type": "keyword"},
        "article_title": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword"}}
        },
        "normalized_article_title": {
            "type": "text",
            "fields": {"keyword": {"type": "keyword"}}
        },
        "text_content": {"type": "text", "fields": {"keyword": {"type": "keyword"}}}
    }
}

Run Services

Configurations is set using environment variables by setting corresponding upper-case names of the config parameters. Environment variables that accept a list (e.g., ESHOSTS and INDEXES) can have commas or spaces as separators. Configuration via a config file in the syntax of the provided config.yml.sample can be used for testing.

Then run the API and UI services using Docker Compose:

$ docker compose up

Access an interactive API documentation and a collection index explorer in a web browser:

Building and Releasing

Deployments are now configured to be automatically built and released via GitHub Actions.

  1. Change the version number stored in ApiVersion.v1 in api.py
  2. Add a small note to the version history below indicating what changed
  3. Commit and tag the repo with the same number
  4. Push the tag to GitHub to trigger the build and release
  5. Once it is done, the labeled image will be ready at https://hub.docker.com/r/mcsystems/news-search-api

Version History

  • v1.4.2 - Fix topdomains aggregation
  • v1.4.1 - Bugfix correcting missed date conversion in client.py
  • v1.4.0 - New endpoints for sub-aggregations, including a small refactor of how aggregation queries are constructed, configurable timeout for elasticsearch
  • v1.3.9 - Overview query includes 'keyword' field for domain aggregator
  • v1.3.8 - Bugfix for 'expanded' results
  • v1.3.7 - Increased default time out on top-terms, better tests
  • v1.3.6 - Major refactor and cleanup, behavior unchanged
  • v1.3.5 - Use mc-manage for deployment record
  • v1.3.4 - Add airtable deployment update script
  • v1.3.3 - Remove 'link' header
  • v1.3.2 - Enhancement to GithubActions and introduces an independent deployment script
  • v1.3.1 - Bugfix for 1.3.0
  • v1.3.0 - Change to return aliases as well as indexes as legal values in Collections, and update article endpoint to work in the ILM context
  • v1.2.0 - Change related to ID update in backend ES, including refurbishing the article endpoint and tests
  • v1.1.0 - Change to return None when data is missing (including publication date), update dependencies
  • v1.0.0 - First official release

Tags for dev and staging releases

Append the suffix a for a dev/alpha release and b for a staging/beta release. e.g

  • v1.3.2b - Version 1.3.2 beta release
  • v1.3.2a - Version 1.3.2 alpha release