Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json-to-warc utility and WARC documentation #336

Open
pgulley opened this issue Oct 9, 2024 · 0 comments
Open

Json-to-warc utility and WARC documentation #336

pgulley opened this issue Oct 9, 2024 · 0 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@pgulley
Copy link
Member

pgulley commented Oct 9, 2024

It will occasionally be the case that we want to ingest stories that are given to us by third parties, and they'll probably find it straightforward to offer that data in the form of json data dumps.

We've chosen WARCs as a kind of catch-all format, and already have ingestion architecture built around them. We should have a standard process- just a single well instrumented script probably- that can take a json dump in some format and produce a WARC archive for us to then ingest via our standard pipeline.

Documentation around this, to give to third parties as they produce data-dumps, would be nice- just a description of the schema we expect and some explanation of the rational behind WARCS

@pgulley pgulley added documentation Improvements or additions to documentation enhancement New feature or request labels Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants