-
Notifications
You must be signed in to change notification settings - Fork 9
Crawler is running into terminal connection refused socket failures #4
Comments
Testing 5802bd3 to address this. |
Crawls in progress. Will check on them tomorrow morning to see if they failed part-way through or not. |
@redshiftzero found that cubie3atuvex2gdw.onion, which redirects to https://another6nnp2ehkn.onion/ (self-signed cert) to have reproduced the error. I'm in the process of refactoring the crawler, but have a couple URLs to add to the "known to have crashed the crawler" list, I should add here soon. These might help in testing/ debugging this problem. There's also been a good amount of discussion of FPF's Slack that I should copy on over here about this bug and plans to figure it out. |
Copying my comments from external discussions about this: Here's the breakdown of what happens: after establishing a connection to a peer on a socket that is bound to a local address, we send a well formed GET request to that peer (an onion service). If this remote end closes the connection without sending a response (i.e., the first line we try to read is empty), then What happens to the rest of the sites is as follows. A well-formed GET request is drafted and socket.connect() method is called to try to connect to the remote onion service. However,
URLs known to be causing the problem: http://money2mxtcfcauot.onion and http://22222222aziwzse2.onion. (More, but I was negligent in saving them.) |
One idea is to basically restart Tor and Tor Browser when this happens. It's a hack, but it isn't my fault that one can't just except and continue this error, and finding/ resolving it upstream has proved to be quit difficult. I'm in the process of implementing that for the refactored |
After beginning to install pip packages to root to "simplify" things, I started noticing some very weird errors when working from within the Ubuntu VMs. Turns out that the command-not-found package Ubuntu installs and runs as a daemon was dependent on python3.4, and would print scary traceback warnings if you typed a wrong command. The tracebacks were pretty terse, so at first look it seemed to me like the pip installation had just gone totally haywire. Setting `update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.4 2` (i.e., setting Python 3.4 as the second option when `/usr/bin/python3` is called) resolved the the problem described above. However, a new scary warning just took it's place, which was caused by the inability to read the `sources.list.d/` file generated by the Ansible `apt_repository` module when adding the deadsnakes ppa. I set the same permissions on this file as `/etc/apt/sources.list` has (`chmod 0644`) and then things seemed to work smoothly. While doing this Ansible Python install work, I also experimented with getting a version of Python 3.5.2 running in a VM. Python 3.5.2 includes bug fixes for Python 3.5.1, some of which are related to exceptions that have been crashing our Crawler (see https://bugs.python.org/issue26402 and #4). These fixes may help us avoid the hacky workarounds I've been trying to implement for the crawler refactor. Unfortunately, installing 3.5.2 proved more difficult than I imagined because there seems to be no easy way to do apt-pinning with Ansible and none of the Debian-derivative boxes from trustworthy, well-known sources that were of distro versions that shipped 3.5.2 (namely, Debian stretch and Ubuntu yakkety) had the vboxfs kernel module installed, which is essential to development. Doing a dist-upgrade is not only time prohibitive, but it seems Ansible can't even handle doing so. As far as apt-pinning goes I tried both `command: aptitude install -y python3.5/unstable` and using the `force: yes` directive in conjunction with the `apt` Ansible module. I am commiting the commented out lines just to include some options since it's still unknown what's causing this crash and how effective the workarounds mentioned in #4 (comment) will be.
Edit: see #4 (comment) for a better explanation and traceback. Don't know why this original report was so half-assed and lacked even the full traceback.
So the crawler is for the most part working very well. Where it runs into problems is what seems to be a Python IO/socket exception (Errno 111). Once it hits this error, it will fail the rest of the way through the crawl pretty instantaneously. See the log at the bottom of this post.
I believe that this is actually cause by a bug in Python3.5--see https://bugs.python.org/issue26402, but this warrants further testing. The PPA we've been using at https://launchpad.net/~fkrull/+archive/ubuntu/deadsnakes?field.series_filter=trusty has not seen an updated version of Python3.5 since December for Ubuntu 14.04 (trusty). This is about our only choice for newer Python versions, and I've already done the work to migrate this script to Python3.5, so we could use a single virtual environment for both the HS sorting and crawling scripts. Since at this point in our research we don't really need to run the sorting script, I think I'll just break compatibility with it by making the necessary changes in the ansible roles to install and use Python3.3 and that should hopefully fix things.
https://bugs.python.org/issue26402
The text was updated successfully, but these errors were encountered: