-
Notifications
You must be signed in to change notification settings - Fork 1
Conversation
Fail *time.Time `json:"fail"` | ||
// url to where the code to execute lives | ||
// example: https://github.com/ipfs/ipfs-wiki/mirror | ||
RepoUrl string `json:"repoCommit"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json:"repoUrl"
Exciting! It would be great to have a code snippet that shows how you would configure a pipeline and run it. For example, what would I need to do to set up a process to
Note: in real-world scenario, you also need to figure out strategies for pinning, unpinning and garbage collecting ipfs content from these processes -- need to keep it pinned long enough for people to replicate the results onto the destination machines, but don't want to have all the content accumulating on servers that are set up for ephemeral process runs. |
It seems we are currently setting-up a pretty similar infrastructure for creating ZIM files :( |
@kelson42 the pattern does look very similar! Wonderful. zimfarm looks like a task management tool specifically for zim files. datatogether is aimed at establishing a pattern for any community to replicate, manage, annotate, and process any data, using decentralized storage patterns. this datatogether/task-mgmt repo is providing some of the tooling to support that pattern. It will be great if we can cross-pollinate between the two projects. There are lots of motivations for using task-mgmt with all sorts of other data that have nothing to do with wikipedia, but the two main motivations for using task-mgmt with wikipedia zim dumps are:
Will it be possible to do those two things with zimfarm? |
The worker part of zimfarm is based on Docker. A job/task is basically a:
So I tend to say yes. Might really make sense to share the whole scheduling part of the solution... then everybody can build its own Docker images and jobs to do whatever they want. |
@b5 -- sounds neat, still wrapping my head around the task types, could you unpack I guess I'm trying to imagine the range of actions (sorry wrong vocab) subsumed within a task and looking at the code I'm positive I'm not parsing marshalling/taskable/task correctly. |
## From a recent Pull Request: | ||
|
||
_TODO - make a proper readme, this is ripped from a recent Pull Request:_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling in the query response, or just even the testing data would be helpful! Also tho still trying to think through how to make clear what tasks are defined
Loud & clear on the example, I'll work on documenting one. In the context of this PR accomplishing this task would amount to submitting a pull request to this repo that explicitly lists the task, which users can then initiate from datatogether.org (so long as they have the right permissions). In the future it may be possible for users to compose disparate tasks into chains-of-tasks from the frontend, but that sounds complicated.
Yes. I'd love to chat more about this one. My initial thought was to store the data-intensive results of these tasks in some s3-like thing & mounting this as the volume ephemeral IPFS nodes read from, but I'd like to learn more about IPFS cluster, and the thing together about long term planning of this infra. Especially as it relates to the un-finalized thing that member institutions & users download to participate in holding "data together data".
Apologies, that's very vague phrasing, mainly b/c it's unfinished work. I do mean the bit in
What a task could be is intentionally vague. I'm currently thinking about tasks as repeatable actions that transition content to the distributed web. This includes moving things onto IPFS, but also everything from the world of metadata, and the list of different task types from above. Any of these tasks can take an arbitrary amount of time, which is why we want to queue them up. The task/taskable naming is, well, awful. @kelson42 Great to see you're doing similar stuff! If there are things we could to do make this service more useful for your needs, I'd love to hear about it! I'll also keep an eye on zimfarm & see if we can't use your code once we find our legs! |
Stepping back for a moment, a few things jump to mind when I read "task execution system":
|
Unrelated to the above, could you also unpack "crawling WARC files onto IPFS"? |
@mhucka -- great points! Agree strongly with 2 & 4 :) |
Hey all, @flyingzumwalt asked if I had some input too. Strongly agree that there's very likely some existing software that matches the requirements, and helps avoid reinventing the wheel. Current CI (continuous integration) systems like Jenkins might also be worth a look. They come with:
(hi @dcwalk o/ we met through toronto meshnet a few times) |
I agree with the inclination to avoid reinventing wheels. The key here, with respect to datatogether, is that we want to encourage ongoing experimentation in this domain. This experimentation should be permissionless (anyone can cook up a new solution and share it with the world), and loosely coordinated (if you have an idea, you share it with collaborators and try to build on existing momentum where possible). Right now there are at least two interesting experiments within the data rescue domain:
The most compelling aspect of the work in this current PR is the pattern of using Pull Requests (on github) as a point of quality control and security review before tasks get modified. This allows us to rely on the existing transparency of git Pull Requests and community patterns around github PRs to ensure that the code (and docker containers, etc) that we use are safe, repeatable, and maintained in a transparent fashion. I think this is a very compelling pattern to explore. It's definitely worth considering DAG-based workflow automation tools like Airflow, Celery, etc. Jenkins is also a good option to consider for the mid-to-long term. If we adopt tools like that, the main thing to carry over from the current proof of concept is this quality-control-via-PRs pattern. In the meantime let's merge this PR. I don't want long-term considerations to prevent us from landing a proof of concept. Instead we should ship the proof and use it to spur conversation about what should come next. Previously this code base (which was running as alpha.archivers.space), relied on administrators, aka @b5 and team, to either manually run tasks or set up chron jobs on a server. This PR is a great improvement over that. What this PR does:
If nobody objects, I will merge this PR tomorrow. |
I should also spin off some GH issues to follow up on the ideas that people surfaced in this thread, so they don't get lost when we close the PR. |
I agree with @flyingzumwalt. In retrospect, I think my comments about existing workflow systems should have gone elsewhere (maybe a separate issue) as they are a bit of a derail w.r.t. this particular PR :-). Sorry about that. |
LOL while I was writing my comments, the last comment by @flyingzumwalt popped up just as I clicked on the green button. Talk about timing. |
I'm willing to start a new issue for this comment of mine, but am unsure how best to copy the content from @flyingzumwalt and @lgierth's follow-ups into the new issue. Do people have preferences or suggestions? |
I'd just create an issue around something like "build on existing task management tools" and mark with the "enhancement" label. As far as capturing comments from me and @lgierth you can either quote and cite (example: ipfs/in-web-browsers#7) or you can just cc us and let us add our own comments to the thread. |
OK, you most excellent people, I created an issue per our discussion upthread, and opted to use the quote-and-cite approach because it seemed the most likely to fully document how we got there. |
rejoining a little late: hey again o/ @lgierth :)) |
This is an overhaul of the task-mgmt, transitioning it from a one-off proof-of-concept to an extensible backend service.
This service started as a standalone example of a single workflow:
So, uh, it didn't do much, but it did set the stage for authenticated task-initiation, which remains a big area in need of development.
This PR Changes task-mgmt to become a service oriented around tasks on a queue, and introduces an interface for writing new kinds of tasks, extending the capabilities of data together over time. Since starting to work with this task-oritented pattern, I've come to believe the majority of work we've been doing in gov archiving work are large, human versions of this pattern, and this gives us a way of expressing those tasks in code, which makes for very exciting potential.
So, breaking the concepts in this PR down:
tasks are any kind of work that needs to get done, but specifically work that would take longer, than say, a web request/response cycle should take. An example of a task can be "put this url on IPFS". another might be "identify the filetype of these 30,000 files, putting the results in a database".
Because this work will take anywhere from milliseconds to days, and may require special things to do that work, it makes sense to put those tasks in a queue, which is just a giant, rolling list of things to do, and have different services be able to add tasks to the queue, and sign up to do tasks as they hit the queue. This PR introduces a first-in-first-out (FIFO) queue to start with, meaning the first thing to get added is the first thing to get pulled off a list.
The queue itself is a server, specifically a rabbitmq sever, it's open source, and based on the open amqp protocol. This means that things that work with the queue don't necessarily need to be written in go. More on that in the future.
The task-mgmt service does just what it says on the tin. It's main job is to manage not just tasks, but the state of tasks as they move through the queue, questions like "what tasks are currently running?" are handled with this PR. As tasks are completed task-mgmt updates records of when tasks started, stopped, etc.
this PR removes all user interfaces and instead introduces both a JSON api and an remote procedure call (RPC) api, the RPC api will be used to fold all of task-mgmt into the greater datatogether api. I know, that's the word api a million times, basically this means we'll have a PR on the datatogether/api to expose tasks so that outside users will access tasks the same way they access, say, coverage, or users. Only internal services will need to use the task-mgmt JSON api as a way of talking across languages.
All of these changes turn the task-mgmt server into a backend service, so that we can fold all task stuff into the existing frontend. This means once the UI is written you'll be able to view, create, and track the progress of tasks from the standard webapp. PR on
datatogether/context
to follow.Along with tracking tasks, task-mgmt both add to and reads from the queue. This might seem strange, but it makes for a more simple starting point. Later on down the road lots of different services may be registered to accept tasks from the queue, at which point we can transition task-mgmt to a role of just adding to the queue and tracking progress.
But most importantly of all, this PR also introduces a new subpackage
task-mgmt/tasks
which outlines the initial interface for a task definition, which is the platform on which tasks can be extended to all sorts of things. Getting this interface right is going to take some time, so I'd like to take some time to write an initial round of task-types, and then re-evaluate the interface. those initial task-types are:This list of task-types is aimed at the high-priority needs from the community. Combining the first 4 task-types with the soon-to-land collections feature gives us everything we need to satisfy some latent EDGI needs (what up @titaniumbones), morph.io runs connect us to the work team boston has been up to, and the rest are for the Protocol Labs Wikipedia-on-IPFS project (what up @flyingzumwalt ). I'm hoping to land all these in a series of PR's in the next 10 days. Once those are landed we'll have to put some sort of permissions-based barriers in place to dicate who is allowed to create what sorts of tasks, that will be a job for a different service.
The next round of task-types can/might include:
From a programmer-participation perspective, we can heavily document how defining a task works, and this will provide a great way for devs to extend the datatogether platform to do things we haven't yet thought of. Lots of ideas for new task-types come up from places like the
ipfs/archives
repo.I'd like to get this merged in order to get working on surfacing tasks within the webapp & API, but discussing the merits of this approach / potential alternatives are in no way off the table. Also, feel free to drop questions, as I'll work them into the readme!