-
Notifications
You must be signed in to change notification settings - Fork 9
Crawl in parallel #9
Comments
So, for one, Selenium doesn't support tabs (see SeleniumHQ/selenium-google-code-issue-archive#5572 (comment)). People have managed to hack around that by using |
Sorry, I misspoke, the Another thing to keep in mind is the |
Sorry, that was unfinished. What I meant to say about the |
To make this work, the crawler would have to be capable of associating a given stream in a general circuit with the onion service that included the resource being loaded over that stream. Since Tor Browser (firefox) does the parsing of those pages and automatically loading of those 3rd-party resources, even if the Selenium API had some way of letting us query which 3rd-party resources are being loaded by a given onion service, there is no way for stem to know what resource might be loading from a given stream. I'm closing this issue because the support we'd need to make this work is just not there, and it's too big an endeavor for too small a use case; the time-benefit tradeoff for us is definitely not there. |
The crawler currently fetches each site one at a time. That was easiest to implement and ensures clean traces. With the introspection into circuits that stem gives, we should be able to identify
(circuit, site-instance)
tuples. This info given to a modifiedrecord_cell_seq()
method could allow us to separate the cells from the different circuits and still get clean traces. Of course, we'd need to save some file pointer state, creating instead a(circuit, site-instance, start_file_ptr)
tuple, since we need to keep track of what time span we should be looking for cells from each instance in ourtor_cell_log
.Unclear how much work this would require and if it could potentially muddy our results by just creating an unrealistic amount of Tor traffic that slows down the loading of each instance.
The text was updated successfully, but these errors were encountered: