Skip to content

Commit

Permalink
Is verified user info (#198)
Browse files Browse the repository at this point in the history
* Add fake-useragent to requirements

* update version numbering in setup and changelog

* update readme

* Additionally scrape for is_verified when scraping user profiles
  • Loading branch information
taspinar authored Jun 22, 2019
1 parent 1fe473b commit 882d256
Show file tree
Hide file tree
Showing 6 changed files with 51 additions and 48 deletions.
52 changes: 36 additions & 16 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,28 @@ Per Tweet it scrapes the following information:
+ Tweet text
+ Tweet html
+ Tweet timestamp
+ Tweet Epoch timestamp
+ Tweet No. of likes
+ Tweet No. of replies
+ Tweet No. of retweets
+ Username
+ User Full Name
+ User ID
+ Tweet is an retweet
+ Username retweeter
+ Userid retweeter
+ Retweer ID

In addition it can scrape for the following user information:
+ Date user joined
+ User location (if filled in)
+ User blog (if filled in)
+ User No. of tweets
+ User No. of tweets
+ User No. of following
+ User No. of followers
+ User No. of likes
+ User No. of lists
+ User is verified


2. Installation and Usage
Expand Down Expand Up @@ -96,7 +104,12 @@ JSON right away. Twitterscraper takes several arguments:
default value is set to today. This does not work in combination with ``--user``.

- ``-u`` or ``--user`` Scrapes the tweets from that users profile page.
This also includes all retweets by that user. See examples below.
This also includes all retweets by that user. See section 2.2.4 in the examples below
for more information.

- ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile
information of the users who have written these tweets. The results will be saved in the
file userprofiles_<filename>.

- ``-p`` or ``--poolsize`` Set the number of parallel processes
TwitterScraper should initiate while scraping for your query. Default
Expand All @@ -121,21 +134,18 @@ JSON right away. Twitterscraper takes several arguments:
- ``-ow`` or ``--overwrite``: With this argument, if the output file already exists
it will be overwritten. If this argument is not set (default) twitterscraper will
exit with the warning that the output file already exists.

- ``--profiles``: twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.
The results will be saved in the file "userprofiles_<filename>".


2.2.1 Examples of simple queries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Below is an example of how twitterscraper can be used:

``twitterscraper Trump --limit 100 --output=tweets.json``
``twitterscraper Trump --limit 1000 --output=tweets.json``

``twitterscraper Trump -l 100 -o tweets.json``
``twitterscraper Trump -l 1000 -o tweets.json``

``twitterscraper Trump -l 100 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``
``twitterscraper Trump -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``



Expand All @@ -149,9 +159,9 @@ as one single query.
Here are some examples:

- search for the occurence of 'Bitcoin' or 'BTC':
``twitterscraper "Bitcoin OR BTC " -o bitcoin_tweets.json -l 1000``
``twitterscraper "Bitcoin OR BTC" -o bitcoin_tweets.json -l 1000``
- search for the occurence of 'Bitcoin' and 'BTC':
``twitterscraper "Bitcoin AND BTC " -o bitcoin_tweets.json -l 1000``
``twitterscraper "Bitcoin AND BTC" -o bitcoin_tweets.json -l 1000``
- search for tweets from a specific user:
``twitterscraper "Blockchain from:VitalikButerin" -o blockchain_tweets.json -l 1000``
- search for tweets to a specific user:
Expand All @@ -167,17 +177,19 @@ Also see `Twitter's Standard operators <https://developer.twitter.com/en/docs/tw
2.2.3 Examples of scraping user pages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also scraped all tweets written or retweetet by a specific user. This can be done by adding the boolean argument ``-u / --user`` argument to the query.
If this argument is used, the query should be equal to the username.
You can also scraped all tweets written or retweetet by a specific user.
This can be done by adding the boolean argument ``-u / --user`` argument.
If this argument is used, the search term should be equal to the username.

Here is an example of scraping a specific user:

``twitterscraper realDonaldTrump -u -o tweets_username.json``
``twitterscraper realDonaldTrump --user -o tweets_username.json``

This does not work in combination with ``-p``, ``-bd``, or ``-ed``.

The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes
all tweets from a profile page (including retweets). The example in 2.2.2 scrapes the results from the search page (excluding retweets).
all tweets from a profile page (including retweets).
The example in 2.2.2 scrapes the results from the search page (excluding retweets).


2.3 From within Python
Expand Down Expand Up @@ -206,15 +218,23 @@ You can easily use TwitterScraper from within python:
2.4 Scraping for retweets
----------------------

A regular search within Twitter will not show you any retweets. Twitterscraper therefore does not contain any retweets in the output. To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet, a search for ``#trump2020`` will only show the original tweet. The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument.
A regular search within Twitter will not show you any retweets.
Twitterscraper therefore does not contain any retweets in the output.

To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet,
a search for ``#trump2020`` will only show the original tweet.

The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument.


2.5 Scraping for User Profile information
----------------------
By adding the argument ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.
The results will be saved in the file "userprofiles_<filename>".

Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :)
It is also possible to scrape for profile information without scraping for tweets. Examples of this can be found in the examples folder.
It is also possible to scrape for profile information without scraping for tweets.
Examples of this can be found in the examples folder.


3. Output
Expand Down
6 changes: 6 additions & 0 deletions changelog.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# twitterscraper changelog

# 1.2.0 ( 2019-06-22 )
### Added
- PR #186: adds the fields is_retweet, retweeter related information, and timestamp_epochs to the output.
- PR #184: use fake_useragent for generation of random user agent headers.
- Additionally scraper for 'is_verified' when scraping for user profile pages.

# 1.1.0 ( 2019-06-15 )
### Added
- PR #176: Using billiard library instead of multiprocessing to add the ability to use this library with Celery.
Expand Down
28 changes: 1 addition & 27 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,5 @@ coala-utils~=0.5.0
bs4
lxml
requests
backports.functools-lru-cache==1.5
BeautifulSoup==3.2.1
beautifulsoup4==4.7.1
bs4==0.0.1
certifi==2019.3.9
chardet==3.0.4
coala-utils==0.5.1
idna==2.8
lxml==4.3.3
requests==2.21.0
soupsieve==1.9.1
twitterscraper==0.9.3
urllib3==1.24.3
backports.functools-lru-cache==1.5
BeautifulSoup==3.2.1
beautifulsoup4==4.7.1
bs4==0.0.1
certifi==2019.3.9
chardet==3.0.4
coala-utils==0.5.1
fake-useragent==0.1.11
idna==2.8
lxml==4.3.3
requests==2.21.0
soupsieve==1.9.1
twitterscraper==0.9.3
urllib3==1.24.3
billiard
fake-useragent
4 changes: 1 addition & 3 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
#!/usr/bin/env python3

from setuptools import setup, find_packages


with open('requirements.txt') as requirements:
required = requirements.read().splitlines()

setup(
name='twitterscraper',
version='1.1.0',
version='1.2.0',
description='Tool for scraping Tweets',
url='https://github.com/taspinar/twitterscraper',
author=['Ahmet Taspinar', 'Lasse Schuirmann'],
Expand Down
2 changes: 1 addition & 1 deletion twitterscraper/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Twitter Scraper tool
"""

__version__ = '1.0.0'
__version__ = '1.2.0'
__author__ = 'Ahmet Taspinar'
__license__ = 'MIT'

Expand Down
7 changes: 6 additions & 1 deletion twitterscraper/user.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

class User:
def __init__(self, user="", full_name="", location="", blog="", date_joined="", id="", tweets=0,
following=0, followers=0, likes=0, lists=0):
following=0, followers=0, likes=0, lists=0, is_verified=0):
self.user = user
self.full_name = full_name
self.location = location
Expand All @@ -15,6 +15,7 @@ def __init__(self, user="", full_name="", location="", blog="", date_joined="",
self.followers = followers
self.likes = likes
self.lists = lists
self.is_verified = is_verified

@classmethod
def from_soup(self, tag_prof_header, tag_prof_nav):
Expand Down Expand Up @@ -47,6 +48,10 @@ def from_soup(self, tag_prof_header, tag_prof_nav):
else:
self.date_joined = date_joined.strip()

tag_verified = tag_prof_header.find('span', {'class': "ProfileHeaderCard-badges"})
if tag_verified is not None:
self.is_verified = 1

self.id = tag_prof_nav.find('div',{'class':'ProfileNav'})['data-user-id']
tweets = tag_prof_nav.find('span', {'class':"ProfileNav-value"})['data-count']
if tweets is None:
Expand Down

0 comments on commit 882d256

Please sign in to comment.