Update metadata and unstructured content extraction #53

csutter · 2023-10-24T09:36:52Z

This updates the content and metadata extraction to match the new semi-final schema and cleans up the content to be more usefully snippetable.

Move some content fields that don't really form part of the primary content to a new additional_searchable_text metadata field (we've originally put it in the primary indexable content out of convenience and because it's what the existing search does, but this allows us to keep the content "cleaner" in case we enable snippeting in the future)
Reorder unstructured content fields into a more natural order (again, would make snippeting more useful)
Reorder metadata fields to match schema (not strictly necessary as it's JSON of course, but makes it easier to follow)
Remove public_timestamp_int field and make the regular public_timestamp an integer (there is no reason we need two fields, the API can convert the integer back into an ISO timestamp at the point of retrieval)
Add additional fields from semi-final schema (content_purpose_supergroup, part_of_taxonomy_tree, locale)

This updates the content and metadata extraction to match the new semi-final schema and cleans up the content to be more usefully snippetable. - Move some content fields that don't really form part of the primary content to a new `additional_searchable_text` metadata field (we've originally put it in the primary indexable content out of convenience and because it's what the existing search does, but this allows us to keep the content "cleaner" in case we enable snippeting in the future) - Reorder unstructured content fields into a more natural order (again, would make snippeting more useful) - Reorder metadata fields to match schema (not strictly necessary as it's JSON of course, but makes it easier to follow) - Remove `public_timestamp_int` field and make the regular `public_timestamp` an integer (there is no reason we need two fields, the API can convert the integer back into an ISO timestamp at the point of retrieval) - Add additional fields from semi-final schema (`content_purpose_supergroup`, `part_of_taxonomy_tree`, `locale`)

csutter merged commit d012157 into main Oct 24, 2023
3 checks passed

csutter deleted the keywords branch October 24, 2023 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update metadata and unstructured content extraction #53

Update metadata and unstructured content extraction #53

csutter commented Oct 24, 2023

Update metadata and unstructured content extraction #53

Update metadata and unstructured content extraction #53

Conversation

csutter commented Oct 24, 2023