Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update metadata and unstructured content extraction #53

Merged
merged 1 commit into from
Oct 24, 2023
Merged

Conversation

csutter
Copy link
Contributor

@csutter csutter commented Oct 24, 2023

This updates the content and metadata extraction to match the new semi-final schema and cleans up the content to be more usefully snippetable.

  • Move some content fields that don't really form part of the primary content to a new additional_searchable_text metadata field (we've originally put it in the primary indexable content out of convenience and because it's what the existing search does, but this allows us to keep the content "cleaner" in case we enable snippeting in the future)
  • Reorder unstructured content fields into a more natural order (again, would make snippeting more useful)
  • Reorder metadata fields to match schema (not strictly necessary as it's JSON of course, but makes it easier to follow)
  • Remove public_timestamp_int field and make the regular public_timestamp an integer (there is no reason we need two fields, the API can convert the integer back into an ISO timestamp at the point of retrieval)
  • Add additional fields from semi-final schema (content_purpose_supergroup, part_of_taxonomy_tree, locale)

This updates the content and metadata extraction to match the new
semi-final schema and cleans up the content to be more usefully
snippetable.

- Move some content fields that don't really form part of the primary
  content to a new `additional_searchable_text` metadata field (we've
  originally put it in the primary indexable content out of convenience
  and because it's what the existing search does, but this allows us to
  keep the content "cleaner" in case we enable snippeting in the future)
- Reorder unstructured content fields into a more natural order (again,
  would make snippeting more useful)
- Reorder metadata fields to match schema (not strictly necessary as
  it's JSON of course, but makes it easier to follow)
- Remove `public_timestamp_int` field and make the regular
  `public_timestamp` an integer (there is no reason we need two fields,
  the API can convert the integer back into an ISO timestamp at the
  point of retrieval)
- Add additional fields from semi-final schema
  (`content_purpose_supergroup`, `part_of_taxonomy_tree`, `locale`)
@csutter csutter merged commit d012157 into main Oct 24, 2023
3 checks passed
@csutter csutter deleted the keywords branch October 24, 2023 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant