-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes #302
base: master
Are you sure you want to change the base?
Conversation
Oh, that's amazing! Does multiple proxies also work with geckodriver? I had tested with Chrome and couldn't get it to work. |
@AllanSCosta a new driver is created for each process in the pool, and each driver is initiated with a unique proxy. This uses FirefoxDriver, but I think ChromeDriver would work for this too. |
Beautiful, thanks!! @lapp0, if you don't mind me asking, why was your previous usage of UserAgent dropped? I just did a quick run on it, and it seemed fine. Thanks! |
@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues. |
I get an error like this: Which file do i need to edit? |
I got error like this: |
Problem solved. I forgot to get Firefox installed...😂 |
You need to install Geckodriver. If it's a mac, |
Oh oops, you're right! I just pushed those changes in misc fixes, reverted! |
Fun side note: if you want to see the browsers in actions (or if theres an issue see what's going wrong) allow the browser to be visible by setting Make sure you limit the size of your pool to 1 though! |
Hi @lapp0, I'm still debugging some stuff here. For some reason, the response is proper (200) and I do manage to get data, but in [edit] Specifically, it seems that |
@AllanSCosta I could not reproduce. I'm able to get 1300 of trumps tweets. Could you try again with latest changes, and set |
As an aside, it appears that scrolling down on twitter stops after 1300 tweets on Edit: It appears the non-js query.py only gets 621 tweets, so this may just be a fundamental limitation in twitter. |
https://github.com/taspinar/twitterscraper/pull/304/files appears to fix the main issue. I am going to make js optional here so we can have a backup if/when #304's solution fails. |
I ran the code |
@AllanSCosta @PUMPKINw can you please
|
The screenshot correctly depicts Trump's twitter (as if I had manually opened the browser and accessed it). Here are the versions: geckodriver 0.26.0 |
thanks @AllanSCosta Are you using selenium-wire==1.1.2? It appears I'm using a dated version (0.7.0), as I was able to reproduce this problem by upgrading to 1.1.2. |
That'd be great, we might even be able to set custom ublock rules to block irrelevant twitter endpoints and speed up scraping. |
@lapp0 I noticed this in the latest commit... And I am curious, with the |
@smuotoe Yeah, that entire function needs to be cleaned up. It's slowly built technical debt as I've handled more and more edge cases. Good catch, pushing a fix. |
@webcoderz thats too bad. Do you know if chromium has these capabilities? |
https://stackoverflow.com/questions/34222412/load-chrome-extension-using-selenium something like that should work using selenium natively to do it |
@webcoderz thanks for researching! |
I just pushed a commit with a lot of improvements
Remaining work:
|
You are so great! |
Thanks! Your work on docker has been great too! |
Hi @lapp0 great work on this thus far. I tested the latest commit and it seems there is something wrong (see above error log). Also, I noticed the |
@smuotoe ya, I set a strict timeout because some proxies are pretty slow, so we want to restart with a new proxy if it takes too long loading twitter. I changed it to 30 seconds though. Probably should have some logic to catch that specific error so the log is cleaner though. |
whats the current status of the examples @lapp0 ? i havent been able to get anything back with the example in the unit test |
Hi, any updates on this PR? Can it already be used for production or is the rate limits problem too massive? I just did some research on proxy list and I think this package sounds quite promising: https://pypi.org/project/proxyscrape/ import proxyscrape
def get_proxies():
collector = proxyscrape.create_collector('default', 'http')
return collector.get_proxies() (Returns about 1000 proxies for me) |
Hi, I am new to this. I tried to follow comments in this branch, but I still got 0 tweets.
console.txt What can I do to this? |
I'm having the same problem as @edmangog:
If I enable logging, I see lots of INFOs about "Got 0 tweets" but no warnings or errors. Anyone having an idea why? Did they just ban Selenium bots, too? :-( |
Sorry for the delayed response, I've been quite busy with professional work lately. Errors@edmangog Your error is likely due to twitter throttling and/or proxy slowness. It tried for 30 seconds to get tweets and failed, resulting in a cascade of additional errors. @LinqLover Your error is because you're using the old interface. Thanks for the link though, the proxy lists linked in the documentation for that project will be worth experimenting with. @webcoderz Yes, unfortunately twitters rate limiting appears to be breaking the test. Core Rate Limiting ProblemSome recent experimentation has indicated two things:
Since we're all using the same proxy list, our bandwidth is collectively limited. I can retrieve significantly more tweets without being throttled on my local IP than with the shared proxy list, however, I am throttled locally regardless. Further experimentation may find ways to stretch the usability of these proxies. I'm not even sure what the exact rate-limits are per IP, and knowing that will be valuable. Regardless, a single proxy will always hit its limits, as will a collection of proxies used by a collection of users. Solution and ImplementationI think the only solution here is to use "personal" proxy servers. This would practically make paid cloud services a requirement for twitterscraper, which may be a necessary evil. As I mentioned I am quite busy with my professional work, but I will dedicate some time this late November stabilizing this branch and making it compatible with a "custom proxy list". Additionally I will need to write instructions for ad-hoc proxy generation. Thanks for your patience. |
try a tor proxy we fixed twints tor proxy and the latency isnt too bad! twintproject/twint#913 heres everything for reference if you choose to go that way, |
thanks for the reference @webcoderz Could you clarify the difference between these two projects? Is there some feature in twitterscraper not present in twint? |
its set up a little differently, but it seems twitter can somewhat detect twint as the last couple of ui changes completely broke it , whereas complete browser scraping cant really be detectable because if done correctly its indiscernable between scraping and actual traffic. (atleast what i think anyways..) |
Hello, sorry I haven't updated this in a while. I've tried to make this work, but unfortunately the only workable solution I've found is with a large number of unused proxies. If someone knows of a way to generate a large number of proxies cheaply, perhaps some kind of proxy as a service, please let me know. |
How about twitterscraper grab data only.
Users buy proxies themselves.
Is that way more reliable than free proxies?
…------------------ 原始邮件 ------------------
发件人: "taspinar/twitterscraper" <[email protected]>;
发送时间: 2021年2月9日(星期二) 晚上6:11
收件人: "taspinar/twitterscraper"<[email protected]>;
抄送: "张志成"<[email protected]>;"Comment"<[email protected]>;
主题: Re: [taspinar/twitterscraper] WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes (#302)
Hello, sorry I haven't updated this in a while.
I've tried to make this work, but unfortunately the only workable solution I've found is with a large number of unused proxies. If someone knows of a way to generate a large number of proxies cheaply, perhaps some kind of proxy as a service, please let me know.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Right, twitterscraper should absolutely be agnostic to the source of the proxies. It's just that I'm not aware of a service I could use to test it out. |
try it with tor not sure the latency aspect but tor is surefire to work
…On Wed, Feb 10, 2021 at 12:37 PM lapp0 ***@***.***> wrote:
Right, twitterscraper should absolutely be agnostic to the source of the
proxies. It's just that I'm not aware of a service I could use to test it
out.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#302 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEXWRAONVNW454QT5B37HFTS6LG6BANCNFSM4NTF7DSA>
.
|
Unfortunately, with real browser scraping tor is extremely slow (30 seconds to load a new chunk of tweets vs 2 seconds for a proxy). This may be alleviated by blocking certain twitter requests though. Either way (proxy or tor), a firefox profile which blocks unnecessary twitter requests would be helpful. |
A variety of issues have recently arisen due to Twitter disabling their "Legacy" API, breaking twitterscraper:
To fix this, I re-implemented
query.py
using Selenium, allowing twitterscraper to programatically control a background (headless) Firefox instance.Additionally, I refactored
query.py
(nowquery_js.py
) to be a bit cleaner.Based on my testing, this branch can successfully download tweets from user pages, and via query strings.
How to run
Please test this change so I can fix any bugs!
python3 setup.py install
If you have any bugs, please paste your command and full output in this thread!
Improvements
get_query_data
(all tweets / metadata from a specific query) andget_user_data
(all tweets / metadata on a users page).--user
wouldn't get all of a users tweets and retweets due to a limitation in twitters scrollback for a given user. Now a workaround enables retrieving of tweets and retweets for a specific user via a custom search:f'filter:nativeretweets from:{from_user}'
query_user_info
brokenNotes
pos
was removed - now the browser is used to storepos
state implicitly--javascript
and-j
now decide whether to usequery.py
orquery_js.py
Problems
(limit has now been implementedlimit
no longer works, though this should be relatively easy to fix if sufficiently desiredquery_user_info
andquery_user_page
haven't been converted to use selenium, they don't work right now. However, this data is returned as part of the metadata mentioned in Improvements bullet 2pip install
. However use of docker can alleviate this.