Crawl in parallel #9

psivesely · 2016-06-21T00:12:35Z

The crawler currently fetches each site one at a time. That was easiest to implement and ensures clean traces. With the introspection into circuits that stem gives, we should be able to identify (circuit, site-instance) tuples. This info given to a modified record_cell_seq() method could allow us to separate the cells from the different circuits and still get clean traces. Of course, we'd need to save some file pointer state, creating instead a (circuit, site-instance, start_file_ptr) tuple, since we need to keep track of what time span we should be looking for cells from each instance in our tor_cell_log.

Unclear how much work this would require and if it could potentially muddy our results by just creating an unrealistic amount of Tor traffic that slows down the loading of each instance.

The text was updated successfully, but these errors were encountered:

psivesely · 2016-06-26T07:05:20Z

So, for one, Selenium doesn't support tabs (see SeleniumHQ/selenium-google-code-issue-archive#5572 (comment)). People have managed to hack around that by using selenium.webdriver.common.Keys to send inputs to the browser like ctrl + t and alt + <tab_number> to hack around that (see https://gist.github.com/lrhache/7686903 for one example). I'm thinking that this technique combined with a Queue and Pool from the multiprocess library could be used to efficient crawl multiple sites at once.

psivesely · 2016-06-27T22:25:33Z

Sorry, I misspoke, the selenium.webdriver.common.Keys hack would mean running one browser. In which case, we can't operate synchronously as does multiprocess (can't issue commands to two different tabs at the same time). We would either need to use selenium.webdriver.common.Keys in combination with asyncio or instead use multiprocess to run multiple instances of the Tor Browser. The problem with the latter solution is that our VPS VMs have only 500 MB of RAM. A few tabs in one TB instance in Xvfb is probably the best we're going to get, so asyncio will probably be the way to go for this.

Another thing to keep in mind is the restart() method, which I will be using to hack around the thus far inexplicable error in #4. You'll see restart() in the refactored crawler, which should be pushed here today.

psivesely · 2016-06-27T22:39:14Z

Sorry, that was unfinished. What I meant to say about the restart_tor_and_tb() method is that it restarts tor and the Tor Browser. That will mean that traces that are in process will need to be restarted once tor and Tor Browser are back up and running. So we'll need some way of keeping track of which sites our workers are collecting traces from, so we can push those sites back on the top of the Queue during the restart process.

psivesely · 2016-07-14T22:42:11Z

To make this work, the crawler would have to be capable of associating a given stream in a general circuit with the onion service that included the resource being loaded over that stream. Since Tor Browser (firefox) does the parsing of those pages and automatically loading of those 3rd-party resources, even if the Selenium API had some way of letting us query which 3rd-party resources are being loaded by a given onion service, there is no way for stem to know what resource might be loading from a given stream. I'm closing this issue because the support we'd need to make this work is just not there, and it's too big an endeavor for too small a use case; the time-benefit tradeoff for us is definitely not there.

psivesely mentioned this issue Jul 2, 2016

Database integration #16

Closed

psivesely closed this as completed Jul 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl in parallel #9

Crawl in parallel #9

psivesely commented Jun 21, 2016

psivesely commented Jun 26, 2016

psivesely commented Jun 27, 2016

psivesely commented Jun 27, 2016

psivesely commented Jul 14, 2016

Crawl in parallel #9

Crawl in parallel #9

Comments

psivesely commented Jun 21, 2016

psivesely commented Jun 26, 2016

psivesely commented Jun 27, 2016

psivesely commented Jun 27, 2016

psivesely commented Jul 14, 2016