Skip to content

Documentation of pygetpapers

Radhu Ladani edited this page May 10, 2021 · 9 revisions

pygetpapers

Summary

  • a Python version of getpapers
  • (py)getpapers issues a search query to a chosen repository via its RESTful API (or by scraping), analyses the hits and systematically downloads the articles without further interaction.

Installation

Ensure that pip is installed along with python. Download python from: https://www.python.org/downloads/ and select the option Add Python to Path while installing.

Check out https://pip.pypa.io/en/stable/installing/ if difficulties installing pip.

Way one (recommended): Ensure git cli is installed and is available in path. Check out (https://git-scm.com/) Enter the command: pip install git+git://github.com/petermr/pygetpapers

Ensure pygetpapers has been installed by reopening terminal and typing the command `pygetpapers`

You should see a help message come up.

Way two: Manually clone the repository and run python setup.py install from inside the repository directory

Ensure pygetpapers has been installed by reopening terminal and typing the command `pygetpapers`

You should see a help message come up.

If user want to update the version of pygetpapers then git pull the repository and enter the command:

pip install git+git://github.com/petermr/pygetpapers

update

Usage

  • Type the command pygetpapers to run the help.

help

Queries are build using -q flag. Find the Query Format here

The query format can be found at http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf

A condensed guide can be found at https://github.com/petermr/pygetpapers/wiki/query-format

Sample queries:

  1. The following query downloads 100 full text xmls, pdfs and supplementary files along with the csv and json(default) for the topic "lantana" and saves them in a directory called "test".

    pygetpapers -q "lantana" -k 100 -o "test" --supp -c -p -x

    1

  2. The following query just prints out the number of hits for the topic lantana

    pygetpapers -n -q "lantana"

    n

  3. The following query just creates the csv output for metadata of 100 papers on the topic lantana in an output directory called "test"

    pygetpapers --onlyquery -q "lantana" -k 100 -o "test" -c

    3

  4. The following query just create the html output for metadata of 100 papers on the topic lantana in an output directory called "test"

    pygetpapers --onlyquery -q "lantana" -k 100 --makehtml -o "test"

    Screenshot (717)

  5. The following nested query downloads 100 full text xmls, pdfs files along with the csv and json(default) for the topic "(lantana camara) AND (eichhornia crassipes)" using logical AND keyword and saves them in a directory called "test"

    pygetpapers -q "(lantana camara) AND (eichhornia crassipes)" -k 100 -o "test" -c -p -x

    AND

  6. If the user wants to update an existing corpus in the directory test which has eupmc_resuts.json with 100 papers of query lantana along with their xmls and pdfs, the following query can be used:

    pygetpapers --update "C:\Users\DELL\test\eupmc_results.json" -q "lantana" -k 10 -x -p

  7. If user wants to download pdfs for a corpus in the directory test which has eupmc_resuts.json which originally only had xmls, or the query broke in between and they want to restart the download of pdfs and xmls, they can use the following query

    pygetpapers --restart "C:\Users\DELL\test\eupmc_results.json" -o "test" -x -p -q "lantana"

    5

  8. If user wants references then following query download references.xml file if available. Requires source for references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR)

    pygetpapers -q "lantana" -k 10 -o "test" -c -x --references PMC

    rrr

    rr

  9. If user wants synonym then --synonym provides results which contain synonyms as well

    pygetpapers --onlyquery -q "lantana" -k 10 -o "test" -c --synonym

    s

  10. If user wants to save the query then --save_query saved the passed query in a config file

    pygetpapers -q "lantana" -k 5 -o "test" -c -p -x --save_query

    save

    save1

  11. If user wants to papers between particular date then --startdate gives papers starting from given date. Format: YYYY-MM-DD and --enddate gives papers till given date. Format: YYYY-MM-DD

    pygetpapers -q "lantana" -k 10 -o "test" -c -x --startdate 2020-05-01 --enddate 2021-05-01

    date

  12. If the user wants to start query from a configuration file then --config config file path to read query for pygetpapers

    pygetpapers --config "C:\Users\DELL\test\saved_config.ini"

    confi