Skip to content
This repository has been archived by the owner on Sep 10, 2020. It is now read-only.

Crawl in parallel #9

Closed
psivesely opened this issue Jun 21, 2016 · 4 comments
Closed

Crawl in parallel #9

psivesely opened this issue Jun 21, 2016 · 4 comments

Comments

@psivesely
Copy link
Contributor

The crawler currently fetches each site one at a time. That was easiest to implement and ensures clean traces. With the introspection into circuits that stem gives, we should be able to identify (circuit, site-instance) tuples. This info given to a modified record_cell_seq() method could allow us to separate the cells from the different circuits and still get clean traces. Of course, we'd need to save some file pointer state, creating instead a (circuit, site-instance, start_file_ptr) tuple, since we need to keep track of what time span we should be looking for cells from each instance in our tor_cell_log.

Unclear how much work this would require and if it could potentially muddy our results by just creating an unrealistic amount of Tor traffic that slows down the loading of each instance.

@psivesely
Copy link
Contributor Author

So, for one, Selenium doesn't support tabs (see SeleniumHQ/selenium-google-code-issue-archive#5572 (comment)). People have managed to hack around that by using selenium.webdriver.common.Keys to send inputs to the browser like ctrl + t and alt + <tab_number> to hack around that (see https://gist.github.com/lrhache/7686903 for one example). I'm thinking that this technique combined with a Queue and Pool from the multiprocess library could be used to efficient crawl multiple sites at once.

@psivesely
Copy link
Contributor Author

Sorry, I misspoke, the selenium.webdriver.common.Keys hack would mean running one browser. In which case, we can't operate synchronously as does multiprocess (can't issue commands to two different tabs at the same time). We would either need to use selenium.webdriver.common.Keys in combination with asyncio or instead use multiprocess to run multiple instances of the Tor Browser. The problem with the latter solution is that our VPS VMs have only 500 MB of RAM. A few tabs in one TB instance in Xvfb is probably the best we're going to get, so asyncio will probably be the way to go for this.

Another thing to keep in mind is the restart() method, which I will be using to hack around the thus far inexplicable error in #4. You'll see restart() in the refactored crawler, which should be pushed here today.

@psivesely
Copy link
Contributor Author

Sorry, that was unfinished. What I meant to say about the restart_tor_and_tb() method is that it restarts tor and the Tor Browser. That will mean that traces that are in process will need to be restarted once tor and Tor Browser are back up and running. So we'll need some way of keeping track of which sites our workers are collecting traces from, so we can push those sites back on the top of the Queue during the restart process.

@psivesely
Copy link
Contributor Author

To make this work, the crawler would have to be capable of associating a given stream in a general circuit with the onion service that included the resource being loaded over that stream. Since Tor Browser (firefox) does the parsing of those pages and automatically loading of those 3rd-party resources, even if the Selenium API had some way of letting us query which 3rd-party resources are being loaded by a given onion service, there is no way for stem to know what resource might be loading from a given stream. I'm closing this issue because the support we'd need to make this work is just not there, and it's too big an endeavor for too small a use case; the time-benefit tradeoff for us is definitely not there.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant