Is verified user info (#198)

* Add fake-useragent to requirements * update version numbering in setup and changelog * update readme * Additionally scrape for is_verified when scraping user profiles
taspinar · Jun 22, 2019 · 882d256 · 882d256
1 parent 1fe473b
commit 882d256
Show file tree

Hide file tree

Showing 6 changed files with 51 additions and 48 deletions.
diff --git a/README.rst b/README.rst
@@ -37,20 +37,28 @@ Per Tweet it scrapes the following information:
  + Tweet text 
  + Tweet html 
  + Tweet timestamp 
+ + Tweet Epoch timestamp
  + Tweet No. of likes
  + Tweet No. of replies
  + Tweet No. of retweets
  + Username
  + User Full Name
  + User ID
+ + Tweet is an retweet
+ + Username retweeter
+ + Userid retweeter
+ + Retweer ID
+
+In addition it can scrape for the following user information: 
  + Date user joined
  + User location (if filled in)
  + User blog (if filled in)
- + User No. of  tweets
+ + User No. of tweets
  + User No. of following
  + User No. of followers
  + User No. of likes
  + User No. of lists
+ + User is verified
 
 
 2. Installation and Usage
@@ -96,7 +104,12 @@ JSON right away. Twitterscraper takes several arguments:
    default value is set to today. This does not work in combination with ``--user``.
 
 -  ``-u`` or ``--user`` Scrapes the tweets from that users profile page.
-   This also includes all retweets by that user. See examples below.
+   This also includes all retweets by that user. See section 2.2.4 in the examples below 
+   for more information.
+
+-  ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile 
+    information of the users who have written these tweets. The results will be saved in the 
+    file userprofiles_<filename>.
 
 -  ``-p`` or ``--poolsize`` Set the number of parallel processes
    TwitterScraper should initiate while scraping for your query. Default
@@ -121,21 +134,18 @@ JSON right away. Twitterscraper takes several arguments:
 -  ``-ow`` or ``--overwrite``: With this argument, if the output file already exists
    it will be overwritten. If this argument is not set (default) twitterscraper will 
    exit with the warning that the output file already exists.
-
--  ``--profiles``: twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.
-    The results will be saved in the file "userprofiles_<filename>".
 
 
 2.2.1 Examples of simple queries
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Below is an example of how twitterscraper can be used:
 
-``twitterscraper Trump --limit 100 --output=tweets.json``
+``twitterscraper Trump --limit 1000 --output=tweets.json``
 
-``twitterscraper Trump -l 100 -o tweets.json``
+``twitterscraper Trump -l 1000 -o tweets.json``
 
-``twitterscraper Trump -l 100 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``
+``twitterscraper Trump -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json``
 
 
 
@@ -149,9 +159,9 @@ as one single query.
 Here are some examples:
 
 -  search for the occurence of 'Bitcoin' or 'BTC':
-   ``twitterscraper "Bitcoin OR BTC " -o bitcoin_tweets.json -l 1000``
+   ``twitterscraper "Bitcoin OR BTC" -o bitcoin_tweets.json -l 1000``
 -  search for the occurence of 'Bitcoin' and 'BTC':
-   ``twitterscraper "Bitcoin AND BTC " -o bitcoin_tweets.json -l 1000``
+   ``twitterscraper "Bitcoin AND BTC" -o bitcoin_tweets.json -l 1000``
 -  search for tweets from a specific user:
    ``twitterscraper "Blockchain from:VitalikButerin" -o blockchain_tweets.json -l 1000``
 -  search for tweets to a specific user:
@@ -167,17 +177,19 @@ Also see `Twitter's Standard operators <https://developer.twitter.com/en/docs/tw
 2.2.3 Examples of scraping user pages
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-You can also scraped all tweets written or retweetet by a specific user. This can be done by adding the boolean argument ``-u / --user`` argument to the query. 
-If this argument is used, the query should be equal to the username. 
+You can also scraped all tweets written or retweetet by a specific user. 
+This can be done by adding the boolean argument ``-u / --user`` argument. 
+If this argument is used, the search term should be equal to the username. 
 
 Here is an example of scraping a specific user:
 
-``twitterscraper realDonaldTrump -u -o tweets_username.json``
+``twitterscraper realDonaldTrump --user -o tweets_username.json``
 
 This does not work in combination with ``-p``, ``-bd``, or ``-ed``.
 
 The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes
-all tweets from a profile page (including retweets). The example in 2.2.2 scrapes the results from the search page (excluding retweets). 
+all tweets from a profile page (including retweets). 
+The example in 2.2.2 scrapes the results from the search page (excluding retweets). 
 
 
 2.3 From within Python
@@ -206,15 +218,23 @@ You can easily use TwitterScraper from within python:
 2.4 Scraping for retweets
 ----------------------
 
-A regular search within Twitter will not show you any retweets. Twitterscraper therefore does not contain any retweets in the output. To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet, a search for ``#trump2020`` will only show the original tweet. The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument. 
+A regular search within Twitter will not show you any retweets. 
+Twitterscraper therefore does not contain any retweets in the output. 
+
+To give an example: If user1 has written a tweet containing ``#trump2020`` and user2 has retweetet this tweet, 
+a search for ``#trump2020`` will only show the original tweet. 
+
+The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the ``-u / --user`` argument. 
 
 
 2.5 Scraping for User Profile information
 ----------------------
 By adding the argument ``--profiles`` twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets.
 The results will be saved in the file "userprofiles_<filename>".
+
 Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :)
-It is also possible to scrape for profile information without scraping for tweets. Examples of this can be found in the examples folder. 
+It is also possible to scrape for profile information without scraping for tweets. 
+Examples of this can be found in the examples folder. 
 
 
 3. Output

diff --git a/changelog.txt b/changelog.txt
@@ -1,5 +1,11 @@
 # twitterscraper changelog
 
+# 1.2.0 ( 2019-06-22 )
+### Added
+- PR #186: adds the fields is_retweet, retweeter related information, and timestamp_epochs to the output.
+- PR #184: use fake_useragent for generation of random user agent headers. 
+- Additionally scraper for 'is_verified' when scraping for user profile pages.
+
 # 1.1.0 ( 2019-06-15 )
 ### Added
 - PR #176: Using billiard library instead of multiprocessing to add the ability to use this library with Celery.

diff --git a/requirements.txt b/requirements.txt
@@ -2,31 +2,5 @@ coala-utils~=0.5.0
 bs4
 lxml
 requests
-backports.functools-lru-cache==1.5
-BeautifulSoup==3.2.1
-beautifulsoup4==4.7.1
-bs4==0.0.1
-certifi==2019.3.9
-chardet==3.0.4
-coala-utils==0.5.1
-idna==2.8
-lxml==4.3.3
-requests==2.21.0
-soupsieve==1.9.1
-twitterscraper==0.9.3
-urllib3==1.24.3
-backports.functools-lru-cache==1.5
-BeautifulSoup==3.2.1
-beautifulsoup4==4.7.1
-bs4==0.0.1
-certifi==2019.3.9
-chardet==3.0.4
-coala-utils==0.5.1
-fake-useragent==0.1.11
-idna==2.8
-lxml==4.3.3
-requests==2.21.0
-soupsieve==1.9.1
-twitterscraper==0.9.3
-urllib3==1.24.3
 billiard
+fake-useragent
diff --git a/setup.py b/setup.py
@@ -1,14 +1,12 @@
 #!/usr/bin/env python3
 
 from setuptools import setup, find_packages
-
-
 with open('requirements.txt') as requirements:
     required = requirements.read().splitlines()
 
 setup(
     name='twitterscraper',
-    version='1.1.0',
+    version='1.2.0',
     description='Tool for scraping Tweets',
     url='https://github.com/taspinar/twitterscraper',
     author=['Ahmet Taspinar', 'Lasse Schuirmann'],

diff --git a/twitterscraper/__init__.py b/twitterscraper/__init__.py
@@ -5,7 +5,7 @@
 Twitter Scraper tool
 """
 
-__version__ = '1.0.0'
+__version__ = '1.2.0'
 __author__ = 'Ahmet Taspinar'
 __license__ = 'MIT'
 

diff --git a/twitterscraper/user.py b/twitterscraper/user.py
@@ -3,7 +3,7 @@
 
 class User:
     def __init__(self, user="", full_name="", location="", blog="", date_joined="", id="", tweets=0, 
-        following=0, followers=0, likes=0, lists=0):
+        following=0, followers=0, likes=0, lists=0, is_verified=0):
         self.user = user
         self.full_name = full_name
         self.location = location
@@ -15,6 +15,7 @@ def __init__(self, user="", full_name="", location="", blog="", date_joined="",
         self.followers = followers
         self.likes = likes
         self.lists = lists
+        self.is_verified = is_verified
 
     @classmethod
     def from_soup(self, tag_prof_header, tag_prof_nav):
@@ -47,6 +48,10 @@ def from_soup(self, tag_prof_header, tag_prof_nav):
         else:    
             self.date_joined = date_joined.strip()
 
+        tag_verified = tag_prof_header.find('span', {'class': "ProfileHeaderCard-badges"})
+        if tag_verified is not None:
+            self.is_verified = 1
+
         self.id = tag_prof_nav.find('div',{'class':'ProfileNav'})['data-user-id']
         tweets = tag_prof_nav.find('span', {'class':"ProfileNav-value"})['data-count']
         if tweets is None: