WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes #302

lapp0 · 2020-06-05T03:50:23Z

A variety of issues have recently arisen due to Twitter disabling their "Legacy" API, breaking twitterscraper:

To fix this, I re-implemented query.py using Selenium, allowing twitterscraper to programatically control a background (headless) Firefox instance.

Additionally, I refactored query.py (now query_js.py) to be a bit cleaner.

Based on my testing, this branch can successfully download tweets from user pages, and via query strings.

How to run

Please test this change so I can fix any bugs!

clone the repo, pull this branch
install selenium dependencies (geckodriver and firefox) https://selenium-python.readthedocs.io/installation.html
enter twitterscraper directory, python3 setup.py install
run your query

If you have any bugs, please paste your command and full output in this thread!

Improvements

Fix twitterscrapers failure due to twitter retiring legacy endpoints
now multiple data points are retrieved, not just tweets, this includes user metedata, location metadata, etc. All these datapoints are sent to the browser and returned by get_query_data (all tweets / metadata from a specific query) and get_user_data (all tweets / metadata on a users page).
Refactor query.py to be more clean
previously --user wouldn't get all of a users tweets and retweets due to a limitation in twitters scrollback for a given user. Now a workaround enables retrieving of tweets and retweets for a specific user via a custom search: f'filter:nativeretweets from:{from_user}'
fix Error user_info #238 query_user_info broken
fix AttributeError: 'NoneType' object has no attribute 'followers' #278

Notes

pos was removed - now the browser is used to store pos state implicitly
--javascript and -j now decide whether to use query.py or query_js.py

Problems

~~limit no longer works, though this should be relatively easy to fix if sufficiently desired~~ (limit has now been implemented
query_user_info and query_user_page haven't been converted to use selenium, they don't work right now. However, this data is returned as part of the metadata mentioned in Improvements bullet 2
This change requires installing selenium and geckodriver which is more difficult than just pip install. However use of docker can alleviate this.
Being that this uses a real browser, it will be slower (~1/2 as fast in my observations) and require more memory
This changes the structure of the returned json object to match twitters response. On the plus side, it allows access to much more data than before.

AllanSCosta · 2020-06-05T03:58:43Z

Oh, that's amazing! Does multiple proxies also work with geckodriver? I had tested with Chrome and couldn't get it to work.

lapp0 · 2020-06-05T04:01:11Z

@AllanSCosta a new driver is created for each process in the pool, and each driver is initiated with a unique proxy.

This uses FirefoxDriver, but I think ChromeDriver would work for this too.

AllanSCosta · 2020-06-05T04:16:48Z

Beautiful, thanks!!

@lapp0, if you don't mind me asking, why was your previous usage of UserAgent dropped? I just did a quick run on it, and it seemed fine.

Thanks!

lapp0 · 2020-06-05T04:45:26Z

@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues.

hakanyusufoglu · 2020-06-05T07:01:55Z

@AllanSCosta users were having trouble due to twitter dropping their legacy endpoints, see the linked issues.
Thank you,

I get an error like this:
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

Which file do i need to edit?

yiw0104 · 2020-06-05T07:46:08Z

I got error like this:
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

yiw0104 · 2020-06-05T07:56:10Z

I got error like this:
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

Problem solved. I forgot to get Firefox installed...😂

AllanSCosta · 2020-06-05T15:22:15Z

I get an error like this:
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.

Which file do i need to edit?

You need to install Geckodriver. If it's a mac, brew install geckodriver should suffice.

lapp0 · 2020-06-05T15:39:58Z

Oh oops, you're right! I just pushed those changes in misc fixes, reverted!

lapp0 · 2020-06-05T15:44:01Z

Fun side note: if you want to see the browsers in actions (or if theres an issue see what's going wrong) allow the browser to be visible by setting driver.headless = False here https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R48

Make sure you limit the size of your pool to 1 though!

AllanSCosta · 2020-06-05T15:44:54Z

Hi @lapp0, I'm still debugging some stuff here. For some reason, the response is proper (200) and I do manage to get data, but in query_single_page the array relevant_requests ends up always empty. For testing I'm running tweets = get_user_data('realDonaldTrump').

[edit] Specifically, it seems that isinstance(r.response.body, dict) is always false in query_single_page

lapp0 · 2020-06-05T17:11:02Z

@AllanSCosta I could not reproduce. I'm able to get 1300 of trumps tweets.

Could you try again with latest changes, and set headless = False, and tell me if you see any errors on the twitter page itself? (Also add -j to your command)

lapp0 · 2020-06-05T17:12:43Z

As an aside, it appears that scrolling down on twitter stops after 1300 tweets on realDonaldTrumps page. I'll investigate how to continue scrolling.

Edit: It appears the non-js query.py only gets 621 tweets, so this may just be a fundamental limitation in twitter.

lapp0 · 2020-06-05T17:34:41Z

https://github.com/taspinar/twitterscraper/pull/304/files appears to fix the main issue. I am going to make js optional here so we can have a backup if/when #304's solution fails.

yiw0104 · 2020-06-05T18:52:26Z

I ran the code tweets = get_user_data('realDonaldTrump') and got 0 tweets.
I also tried tweets = get_query_data("BTS", poolsize = 1, lang = 'english') and got nothing as well.

lapp0 · 2020-06-05T18:57:48Z

@AllanSCosta @PUMPKINw can you please

add driver.get_screenshot("foo.png") to this line https://github.com/taspinar/twitterscraper/pull/302/files#diff-83a91a4e1920f0a97f5f9b7c5eabefc5R126

then share the resulting screenshot

share your geckodriver version
share your firefox version
share your operating system and version
share your selenium version

AllanSCosta · 2020-06-05T19:24:46Z

@lapp0

The screenshot correctly depicts Trump's twitter (as if I had manually opened the browser and accessed it). Here are the versions:

geckodriver 0.26.0
Firefox 77.0.1 (64-bit)
OS and version macOS Mojave 10.14.5
Selenium 3.141.0

lapp0 · 2020-06-05T19:38:02Z

thanks @AllanSCosta

Are you using selenium-wire==1.1.2? It appears I'm using a dated version (0.7.0), as I was able to reproduce this problem by upgrading to 1.1.2.

lapp0 · 2020-09-26T01:08:04Z

That'd be great, we might even be able to set custom ublock rules to block irrelevant twitter endpoints and speed up scraping.

smuotoe · 2020-09-26T05:22:10Z

@lapp0 I noticed this in the latest commit... And I am curious, with the return keywords in the if...else conditional statement, will the code ever go further to retry?

lapp0 · 2020-09-26T05:32:00Z

@smuotoe Yeah, that entire function needs to be cleaned up. It's slowly built technical debt as I've handled more and more edge cases.

Good catch, pushing a fix.

webcoderz · 2020-09-26T15:03:39Z

https://blog.mozilla.org/addons/2019/10/31/firefox-to-discontinue-sideloaded-extensions/

lapp0 · 2020-09-27T03:26:59Z

@webcoderz thats too bad. Do you know if chromium has these capabilities?

webcoderz · 2020-09-27T04:24:19Z

https://stackoverflow.com/questions/34222412/load-chrome-extension-using-selenium something like that should work using selenium natively to do it

lapp0 · 2020-09-27T06:39:10Z

@webcoderz thanks for researching!

lapp0 · 2020-09-27T06:43:39Z

I just pushed a commit with a lot of improvements

proxies now work with selenium-wire
we find the sqrt(N) fastest proxies and use them. Scraping is observably faster.
code is refactored, cleaned up, and has better error handling / retrying parameters. Is less prone to failure.

Remaining work:

the main problem: rate limiting by twitter. Perhaps we need a more extensive proxy list? Perhaps different browser profiles might help?
write more test cases
use a larger proxy list
cache the proxy priority list so we aren't querying 100s of proxies each run
handle misc edge cases

zhicheng0501 · 2020-09-27T08:25:39Z

I just pushed a commit with a lot of improvements

proxies now work with selenium-wire

we find the sqrt(N) fastest proxies and use them. Scraping is observably faster.

code is refactored, cleaned up, and has better error handling / retrying parameters. Is less prone to failure.

Remaining work:

the main problem: rate limiting by twitter. Perhaps we need a more extensive proxy list? Perhaps different browser profiles might help?

write more test cases

use a larger proxy list

cache the proxy priority list so we aren't speedtesting 100s of proxies each run

handle misc edge cases

You are so great!
Hope you will fix the problem and make twitterscraper run more smoothly soon.

lapp0 · 2020-09-28T04:30:55Z

Thanks! Your work on docker has been great too!

smuotoe · 2020-09-28T08:00:19Z

Exception Message: Timeout loading page after 10000ms while requesting "https://twitter.com/search?f=live&vertical=default&q=realDonaldTrump since:2020-01-31 until:2020-02-01" Traceback (most recent call last): File "C:\Users\tempSomto\OneDrive\twitterscraper-lappo\query_js.py", line 64, in retrieve_twitter_response_data driver.get(url) File "C:\Users\tempSomto\OneDrive\twitterscraper-lappo\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 333, in get self.execute(Command.GET, {'url': url}) File "C:\Users\tempSomto\OneDrive\twitterscraper-lappo\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Users\tempSomto\OneDrive\twitterscraper-lappo\venv\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message: Timeout loading page after 10000ms

Hi @lapp0 great work on this thus far. I tested the latest commit and it seems there is something wrong (see above error log).

Also, I noticed the TimeoutException that was imported in query_js was not used in the file.

lapp0 · 2020-09-28T16:22:20Z

@smuotoe ya, I set a strict timeout because some proxies are pretty slow, so we want to restart with a new proxy if it takes too long loading twitter. I changed it to 30 seconds though.

Probably should have some logic to catch that specific error so the log is cleaner though.

webcoderz · 2020-10-03T17:43:42Z

whats the current status of the examples @lapp0 ? i havent been able to get anything back with the example in the unit test

LinqLover · 2020-10-13T12:49:15Z

Hi, any updates on this PR? Can it already be used for production or is the rate limits problem too massive?

I just did some research on proxy list and I think this package sounds quite promising: https://pypi.org/project/proxyscrape/
Usage:

import proxyscrape

def get_proxies():
    collector = proxyscrape.create_collector('default', 'http')
    return collector.get_proxies()

(Returns about 1000 proxies for me)
Could this help?

edmangog · 2020-10-14T07:42:57Z

Hi, I am new to this. I tried to follow comments in this branch, but I still got 0 tweets.
Code:

from twitterscraper import query_js
import datetime as dt
import pandas as pd

if __name__ == '__main__':
    tweets = query_js.get_query_data(begindate=dt.date(2019, 11, 11), enddate=dt.date(2019, 11, 12), poolsize=5, lang='en',
                            query='foo bar')
    df = pd.DataFrame(t.__dict__ for t in tweets)
    print(tweets)
    print(df)

console.txt
In the console, there are three errors occurred during the run:
1.TypeError: the JSON object must be str, bytes or bytearray, not list
2.selenium.common.exceptions.TimeoutException: Message: Timeout loading page after 30000ms
3.ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine

What can I do to this?

LinqLover · 2020-10-23T18:12:54Z

I'm having the same problem as @edmangog:

>>> import twitterscraper as ts
>>> ts.query_tweets('obama', limit=10)
[]

If I enable logging, I see lots of INFOs about "Got 0 tweets" but no warnings or errors.

Anyone having an idea why? Did they just ban Selenium bots, too? :-(

lapp0 · 2020-10-28T04:53:41Z

Sorry for the delayed response, I've been quite busy with professional work lately.

Errors

@edmangog Your error is likely due to twitter throttling and/or proxy slowness. It tried for 30 seconds to get tweets and failed, resulting in a cascade of additional errors.

@LinqLover Your error is because you're using the old interface. Thanks for the link though, the proxy lists linked in the documentation for that project will be worth experimenting with.

@webcoderz Yes, unfortunately twitters rate limiting appears to be breaking the test.

Core Rate Limiting Problem

Some recent experimentation has indicated two things:

1. We can shard queries by including / excluding common words (e.g. foo bar -> foo bar "and" + foo bar -"and"), which can be applied to an arbitrarily large number of splits. This can reduce the error rate for high volume queries significantly. For example, I previously could not download all mentions of "Donald Trump" for a single day, however this new technique allows shards to be retrieved and accumulated.
1. Twitter blocks high volume requests from a single IP.

Since we're all using the same proxy list, our bandwidth is collectively limited. I can retrieve significantly more tweets without being throttled on my local IP than with the shared proxy list, however, I am throttled locally regardless.

Further experimentation may find ways to stretch the usability of these proxies. I'm not even sure what the exact rate-limits are per IP, and knowing that will be valuable. Regardless, a single proxy will always hit its limits, as will a collection of proxies used by a collection of users.

Solution and Implementation

I think the only solution here is to use "personal" proxy servers. This would practically make paid cloud services a requirement for twitterscraper, which may be a necessary evil.

As I mentioned I am quite busy with my professional work, but I will dedicate some time this late November stabilizing this branch and making it compatible with a "custom proxy list". Additionally I will need to write instructions for ad-hoc proxy generation.

Thanks for your patience.

webcoderz · 2020-10-28T15:33:24Z

try a tor proxy we fixed twints tor proxy and the latency isnt too bad! twintproject/twint#913 heres everything for reference if you choose to go that way,

lapp0 · 2020-10-28T23:58:06Z

thanks for the reference @webcoderz

Could you clarify the difference between these two projects? Is there some feature in twitterscraper not present in twint?

webcoderz · 2020-10-29T19:53:59Z

its set up a little differently, but it seems twitter can somewhat detect twint as the last couple of ui changes completely broke it , whereas complete browser scraping cant really be detectable because if done correctly its indiscernable between scraping and actual traffic. (atleast what i think anyways..)

lapp0 · 2021-02-09T10:11:36Z

Hello, sorry I haven't updated this in a while.

I've tried to make this work, but unfortunately the only workable solution I've found is with a large number of unused proxies. If someone knows of a way to generate a large number of proxies cheaply, perhaps some kind of proxy as a service, please let me know.

zhicheng0501 · 2021-02-10T12:37:29Z

How about twitterscraper grab data only. Users buy proxies themselves. Is that way more reliable than free proxies?

…

------------------ 原始邮件 ------------------ 发件人: "taspinar/twitterscraper" <[email protected]>; 发送时间: 2021年2月9日(星期二) 晚上6:11 收件人: "taspinar/twitterscraper"<[email protected]>; 抄送: "张志成"<[email protected]>;"Comment"<[email protected]>; 主题: Re: [taspinar/twitterscraper] WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes (#302) Hello, sorry I haven't updated this in a while. I've tried to make this work, but unfortunately the only workable solution I've found is with a large number of unused proxies. If someone knows of a way to generate a large number of proxies cheaply, perhaps some kind of proxy as a service, please let me know. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

lapp0 · 2021-02-10T18:37:02Z

Right, twitterscraper should absolutely be agnostic to the source of the proxies. It's just that I'm not aware of a service I could use to test it out.

webcoderz · 2021-02-10T18:50:03Z

try it with tor not sure the latency aspect but tor is surefire to work

…

On Wed, Feb 10, 2021 at 12:37 PM lapp0 ***@***.***> wrote: Right, twitterscraper should absolutely be agnostic to the source of the proxies. It's just that I'm not aware of a service I could use to test it out. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#302 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEXWRAONVNW454QT5B37HFTS6LG6BANCNFSM4NTF7DSA> .

lapp0 · 2021-02-16T16:36:46Z

Unfortunately, with real browser scraping tor is extremely slow (30 seconds to load a new chunk of tweets vs 2 seconds for a proxy). This may be alleviated by blocking certain twitter requests though.

Either way (proxy or tor), a firefox profile which blocks unnecessary twitter requests would be helpful.

andrew added 2 commits June 4, 2020 23:16

Implement query_js.py using Selenium

5c1c08f

make headless

9564c6d

This was referenced Jun 5, 2020

Twitterscraper is showing 0 new tweets from today 6 May 2020 1:39 am IST #301

Closed

Twitterscraper was working fine till tuesday but now it shows 0 tweets #299

Closed

Scraper still working ? #298

Closed

0 tweets #296

Open

misc fixes

7f7b753

andrew added 2 commits June 5, 2020 11:40

fix retries

3cc39c3

fix typo, make headless

f0a47c6

sleep less

c5a61aa

include @abhisheksaxena1998's fix

2dfb19d

allow interoperability between old and js

9246ad5

andrew added 2 commits September 25, 2020 21:11

do multiple-passes, fix proxy, faster scrolling

d6cd1d8

Merge branch 'selenium' of github.com:lapp0/twitterscraper into selenium

dc816cd

refactor browser & scraping, get best proxies, add test_simple_js

a6a76f7

remove unused imports, increase timeout

8d665d0

davralin mentioned this pull request Oct 19, 2020

Twitter-scraper no longer works vranki/hemppa#105

Closed

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes #302

Are you sure you want to change the base?

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes #302

Conversation

lapp0 commented Jun 5, 2020 • edited Loading

How to run

Improvements

Notes

Problems

AllanSCosta commented Jun 5, 2020 • edited Loading

lapp0 commented Jun 5, 2020 • edited Loading

AllanSCosta commented Jun 5, 2020 • edited Loading

lapp0 commented Jun 5, 2020

hakanyusufoglu commented Jun 5, 2020

yiw0104 commented Jun 5, 2020

yiw0104 commented Jun 5, 2020

AllanSCosta commented Jun 5, 2020 • edited Loading

lapp0 commented Jun 5, 2020 • edited Loading

lapp0 commented Jun 5, 2020 • edited Loading

AllanSCosta commented Jun 5, 2020 • edited Loading

lapp0 commented Jun 5, 2020 • edited Loading

lapp0 commented Jun 5, 2020 • edited Loading

lapp0 commented Jun 5, 2020

yiw0104 commented Jun 5, 2020 • edited Loading

lapp0 commented Jun 5, 2020

AllanSCosta commented Jun 5, 2020

lapp0 commented Jun 5, 2020

lapp0 commented Sep 26, 2020

smuotoe commented Sep 26, 2020

lapp0 commented Sep 26, 2020

webcoderz commented Sep 26, 2020

lapp0 commented Sep 27, 2020

webcoderz commented Sep 27, 2020

lapp0 commented Sep 27, 2020

lapp0 commented Sep 27, 2020 • edited Loading

zhicheng0501 commented Sep 27, 2020 • edited by lapp0 Loading

lapp0 commented Sep 28, 2020

smuotoe commented Sep 28, 2020

lapp0 commented Sep 28, 2020 • edited Loading

webcoderz commented Oct 3, 2020

LinqLover commented Oct 13, 2020 • edited Loading

edmangog commented Oct 14, 2020 • edited Loading

LinqLover commented Oct 23, 2020

lapp0 commented Oct 28, 2020 • edited Loading

Errors

Core Rate Limiting Problem

Solution and Implementation

webcoderz commented Oct 28, 2020

lapp0 commented Oct 28, 2020

webcoderz commented Oct 29, 2020

lapp0 commented Feb 9, 2021

zhicheng0501 commented Feb 10, 2021 via email

lapp0 commented Feb 10, 2021

webcoderz commented Feb 10, 2021 via email

lapp0 commented Feb 16, 2021

lapp0 commented Jun 5, 2020 •

edited

Loading

AllanSCosta commented Jun 5, 2020 •

edited

Loading

lapp0 commented Jun 5, 2020 •

edited

Loading

AllanSCosta commented Jun 5, 2020 •

edited

Loading

AllanSCosta commented Jun 5, 2020 •

edited

Loading

lapp0 commented Jun 5, 2020 •

edited

Loading

lapp0 commented Jun 5, 2020 •

edited

Loading

AllanSCosta commented Jun 5, 2020 •

edited

Loading

lapp0 commented Jun 5, 2020 •

edited

Loading

lapp0 commented Jun 5, 2020 •

edited

Loading

yiw0104 commented Jun 5, 2020 •

edited

Loading

lapp0 commented Sep 27, 2020 •

edited

Loading

zhicheng0501 commented Sep 27, 2020 •

edited by lapp0

Loading

lapp0 commented Sep 28, 2020 •

edited

Loading

LinqLover commented Oct 13, 2020 •

edited

Loading

edmangog commented Oct 14, 2020 •

edited

Loading

lapp0 commented Oct 28, 2020 •

edited

Loading