-
Notifications
You must be signed in to change notification settings - Fork 1
How do we add "pattern requests" for scraping functionality? #5
Comments
@dcwalk I'm not able to see the linked issue, it's giving me a 404. Is there text beyond the quote that I'm not seeing? I think for this specific example, each URL would be a separate entity, and the majority of the child pages should be crawlable via the crawler b/c they're just html/json/xml. However, there are a couple services that wouldn't be crawlable, such as this application which generates KML files from map layers, which we'd need to address specifically. So I don't have a good answer for this, but I think it raises a bigger question. My general sense of Data Together’s goal is: to provide an interface for community-held data in a way that makes it easy to access/publish data on distributed infrastructure (i.e. IPFS), and the website archiving project is a use case of DT, motivated by the desire to backup websites and data on independent machines, essentially providing a community-cache of websites and the associated data. For clarity, I'll refer to the two types of caches separately (URLs/websites vs. data) The relevant topics brought up in this issue are:
My proposed solution kind of sidesteps the issues by moving the responsibility to the community, which I think is more sustainable and also thematically appropriate:
So you can think of this as a bipartite graph between URLs and data caches, but the community gets to choose the weight of the links. So I like this because it's more community-oriented; it's impossible and undesirable to anticipate all of the ways that data could look like, and this flexibility allows the community to prioritize and present data in ways that they think is most appropriate. I think the drawbacks are that we have fewer guarantees, since things aren't enforced at the code-level, but instead at the community level. I'm interested in hearing people's thoughts! |
Thanks @jeffreyliu for taking the time to work through this topic. I'm pickin' up what you're throwin' down. If possible, it'd be great to figure out a way to integrate our findings here with the wonderful groundwork put forth by @mhucka on the topic. I agree with everything you've said, so I'm going to rephrase it & see if we're on the same page. As you've pointed out, when we say "uncrawlable", we're effectively saying: some resource we don't have direct access to, presented on the web. Analogy time! I like to think of every stated uncrawlable resource is "a call for essays on a topic". Assuming this "topic" is a single url that doesn't crawl nicely (that will probably broaden in the future). A volunteer then submits an "essay", on the topic, which is the result of running their script. And like any good essay, it's important to have clear citation. There's one minimum citation: the script that produced the result, and zero or more "cited" urls that this essay cites as being connected to the output. Further, essays must be structured in the same way. In our case this means all scripts must produce structured data. (ideally, a single results table), this last caveat means that submissions can be compared. The only major difference in this analogy is there's no teach-student relationship, the community is both. Hopefully this draws us back to the same conclusions you've stated @jeffreyliu: we should focus on scaffolding the process of essay submission, leaving topic dictation & paper-grading to the community. Anyway, that's a lot of writing to say I agree completely with @jeffreyliu's take, hopefully this diatribe will help us when it comes to implementation details. |
@b5 great metaphor! Yeah, I'm thinking of it as a conversation in that the community identifies "hey we have this problem" and different people can submit different solutions, which may address the problem in different ways. One example I'm thinking of is like an interactive map with layers - some people in the community might want the raw data so they can do computational analysis on it, but some other people may just want the layers as images. So there can be multiple solutions to the same problem that address different needs. So all we do is enforce that the solution shows its work (source code) and that it adheres to some very broad guidelines about data integrity, and it's up to the community to determine the connections between that solution and other resources on DT.
Yes, exactly! Though we can still provide examples of "good" submissions as guidelines, but they aren't hard and fast, because what ultimately matters is whether it's useful for the community. |
On standup tonight:
Gonna move issues to Roadmap & DataTogether repos to tease out these parts |
@jeffreyliu's comment upstream touches on a lot of topics that probably should be discussed individually. Do we have a better way of doing that than writing more comments on this issue here? The linear stream of comments in GitHub issues makes for difficult reading when several topics need to be covered. I'm not sure making separate issues would be ideal either, because then things are disconnected, but I guess in we could combine that with a page somewhere that gathers the topics together and links to the individual issues (maybe as a wiki page? or something else?). |
Just looking at edgi-govdata-archiving/archivers-harvesting-tools#8 as I wind down/clean up EDGI repos, and I think this would be wonderful. How would this type of ask work with the new and improved archivertools?
The text was updated successfully, but these errors were encountered: