Establish a data control plan #8

jbenet · 2017-05-01T17:26:38Z

we should outline the control plan (hand over to wikipedia itself, etc)

flyingzumwalt · 2017-05-12T01:18:26Z

Though we hope that wikipedia will take this under their wing sometime, we should not assume that they will. Based on that, we're setting up a community-based model for managing the generation of snapshots from kiwix dumps. This is one of the first tests of the model that evolved out of the Data Rescue hackathons in early 2017 -- where communities of hackers, content specialists and do-gooders work together to manage the work of pulling data off of centralized servers and redistributing it.

To apply this model we're partnering with @b5 from http://www.qri.io/, who did a lot of the technical work behind the Data Rescue hackathons. Many other people like @dcwalk @titaniumbones @mayaad @trinberg @ abergman contributed to the evolution of this model.

The Process

Key elements of this process:

Embrace community contributions with an open model of community governance. In short, use github and PRs to manage everything. Actively embrace contributions by community members, give them a voice in governance of the code, and provide a clear definition of the requirements to become a committer.
Use code to automate repeatable tasks: rather than having lots of people write one-off scripts and run them once, put that energy into building and maintaining reusable scripts.
Need to be careful about provenance and chain of custody: It's important to be clear exactly where the snapshots came from and exactly what was done to them. To enforce this, we have to be careful about who runs the scripts and how they run the scripts.

Balancing Open Community with Careful Chain of Custody

It may seem like the open community model is at odds with maintaining a clear chain of custody when processing the snapshots. Here's how we will balance the two:

Open community contributions (via github Pull Requests, etc) wherever possible.

maintaining the scripts that pull dumps from kiwix
maintaining any scripts that modify snapshots and write them to ipfs
nominating new language variants to be added as snapshots
deciding when to run new snapshots
maintaining the docker container that is used to run these scripts
... With an open governance model around who can become a committer on the repo, etc.

Meanwhile a smaller group of committers will handle:

running the scripts, using the community-managed docker image, to generate new snapshots
publishing updates to the IPNS entries

Eventually we might incorporate cryptographic techniques (ie. SNARKS) to prove that the intended operations (and only the intended operations) were run on the snapshots, which would allow anyone to build the snapshots without corrupting the chain of custody. This will require some research. For now, it's overkill.

flyingzumwalt · 2017-05-12T01:27:03Z

Note: one cool thing about using IPFS with this structure: if you want to validate that someone actually ran the scripts they claim, you can just re-run the scripts from the same sources and compare the hashes of the results...

dcwalk · 2017-05-13T20:04:20Z

pinging @patcon and ~~@meyerscr~~ (edit, didn't need to ;)) to watch here

b5 · 2017-05-17T18:58:23Z

Ok we've started to make progress on this. Currently this is just defaulting to sending emails while we figure out how to connect the requests to a queue, but it's a start.

Live url here: https://task-mgmt.archivers.space
Repo here: https://github.com/archivers-space/task-mgmt

Note, you'll need write access to ipfs/distributed-wikipedia-mirror in order to access the page.

I've outlined some next steps in the repo readme, @flyingzumwalt it might make sense to touch base on next steps sometime soon, specifically around the question of where the actual task execution is going to happen. If we need to build that, that's ok. In the meantime I still have lots to chew on.

Kubuxu · 2017-05-19T20:24:03Z

The archivers requesting full private repo access is no go for me unfortunately.

Many platforms allow for public and separate upgrade to private repo access when need arrives.

flyingzumwalt · 2017-05-20T01:24:27Z

Is archivers requesting access? I thought it was just using GH oauth response to know if the user has write access to this repo -- so you need write permission in the GH repo in order to manage stuff in archivers. That lets us set it up so that anyone who can modify this repo can also manage things in archivers like kicking off building a new snapshot. The actual submission of new content from archivers or from the workers it runs will be done vi PRs, which does not require write access to this repo.

Kubuxu · 2017-05-20T14:23:19Z

The management page does: https://task-mgmt.archivers.space if you try to login with GH.

flyingzumwalt · 2017-05-20T15:20:02Z

aha. yeah we have to change that.

b5 · 2017-05-20T16:52:55Z

Oh yes completely agreed. I'll drop the permissions ask, will report back once the change is up.

b5 · 2017-05-30T16:28:22Z

Ok, change is now live. App shouldn't request access to private repos.

flyingzumwalt · 2017-07-07T04:34:04Z

Update: @b5 is making amazing progress building a robust and reusable solution for our data-control needs datatogether/task_mgmt#4

flyingzumwalt changed the title ~~data control plan~~ Establish a data control plan May 12, 2017

flyingzumwalt mentioned this issue Jun 11, 2017

Hackathon Wikimania 2017 #41

Closed

flyingzumwalt mentioned this issue May 17, 2017

Provide recommendations for using InterPlanetary Test Lab to generate wikipedia snapshots #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish a data control plan #8

Establish a data control plan #8

jbenet commented May 1, 2017

flyingzumwalt commented May 12, 2017

flyingzumwalt commented May 12, 2017

dcwalk commented May 13, 2017 •

edited

Loading

b5 commented May 17, 2017

Kubuxu commented May 19, 2017

flyingzumwalt commented May 20, 2017

Kubuxu commented May 20, 2017 •

edited

Loading

flyingzumwalt commented May 20, 2017

b5 commented May 20, 2017

b5 commented May 30, 2017

flyingzumwalt commented Jul 7, 2017

Establish a data control plan #8

Establish a data control plan #8

Comments

jbenet commented May 1, 2017

flyingzumwalt commented May 12, 2017

The Process

Balancing Open Community with Careful Chain of Custody

flyingzumwalt commented May 12, 2017

dcwalk commented May 13, 2017 • edited Loading

b5 commented May 17, 2017

Kubuxu commented May 19, 2017

flyingzumwalt commented May 20, 2017

Kubuxu commented May 20, 2017 • edited Loading

flyingzumwalt commented May 20, 2017

b5 commented May 20, 2017

b5 commented May 30, 2017

flyingzumwalt commented Jul 7, 2017

dcwalk commented May 13, 2017 •

edited

Loading

Kubuxu commented May 20, 2017 •

edited

Loading