Skip to content

Comparison of `getpapers` and `pygetpapers`

petermr edited this page Apr 17, 2021 · 3 revisions

getpapers and pygetpapers will continue to be used interchangeably for some time. It's important that:

  • the differences are minor
  • the differences are documented

The searches covid20 and ebola1 give small numbers of hits (currently <10) and are unlikely to change very rapidly. They are probably misprints or incorrect names.

getpapers

getpapers -q ebola1 -a -x -o ebola1g
info: Searching using eupmc API
(node:61796) Warning: Accessing non-existent property 'padLevels' of module exports inside circular dependency
(Use `node --trace-warnings ...` to show where the warning was created)
info: Found 10 results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.5 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
warn: Article with pmcid "PMC7822617" was not Open Access (therefore no XML)
warn: Article with doi "10.2139/ssrn.3640311 did not have a PMCID (therefore no XML)
warn: Article with doi "10.1101/2020.03.29.014209 did not have a PMCID (therefore no XML)
warn: Article with doi "10.21203/rs.3.rs-29546/v1 did not have a PMCID (therefore no XML)
info: Got XML URLs for 6 out of 10 results
info: Downloading fulltext XML files
Downloading files [==============================] 100% (6/6) [0.0s elapsed, eta 0.0]
info: All downloads succeeded!

gives:

tree ebola1g/
ebola1g/
├── 10.1101
│   └── 2020.03.29.014209
│       └── eupmc_result.json
├── 10.21203
│   └── rs.3.rs-29546
│       └── v1
│           └── eupmc_result.json
├── 10.2139
│   └── ssrn.3640311
│       └── eupmc_result.json
├── PMC4339239
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC4493273
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC4587849
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC5630589
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC7130407
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC7436544
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC7822617
│   └── eupmc_result.json
├── eupmc_fulltext_html_urls.txt
└── eupmc_results.json

14 directories, 18 files

pygetpapers

(The command -a is not recognised but should be re-instated for compatibility).

pygetpapers -q ebola1 -x -o ebola1p
INFO: Total Hits are 10
INFO: Total Hits are 10
WARNING: Could not find more papers
WARNING: html url not found for paper 1
WARNING: Abstract not found for paper 1
WARNING: Keywords not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Author list not found for paper 1
WARNING: Abstract not found for paper 2
WARNING: Keywords not found for paper 2
WARNING: Keywords not found for paper 3
WARNING: Abstract not found for paper 4
WARNING: Keywords not found for paper 4
WARNING: Author list not found for paper 4
WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 6
WARNING: Author list not found for paper 6
WARNING: Keywords not found for paper 7
INFO: Saving XML files to /Users/pm286/projects/aroma_game/ebola1p/*/fulltext.xml
INFO: */Wrote xml for PMC7822617/
INFO: */Wrote xml for PMC7436544/
INFO: */Wrote xml for PMC4587849/
INFO: */Wrote xml for PMC7130407/
INFO: */Wrote xml for PMC5630589/
INFO: */Wrote xml for PMC4339239/
INFO: */Wrote xml for PMC4493273/

giving

$ tree ebola1p
ebola1p
├── PMC4339239
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC4493273
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC4587849
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC5630589
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC7130407
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC7436544
│   ├── eupmc_result.json
│   └── fulltext.xml
├── PMC7822617
│   ├── eupmc_result.json
│   └── fulltext.xml
└── eupmc_results.json

# differences
* `getpapers` downloads the `-a` metadata. This should be included in `pygetpapers`
* `getpapers` includes all articles, not just `PMC...`. This should be included in `pygetpapers`
* The INFO and WARNINGS are different
* `pygetpapers` should not use "paper 1", etc. Use the IDs
* `getpapers` writes eupmc_fulltext_html_urls.txt. This is useful.
* `getpapers` mangles filenames "abc/def/*.json" into a false directory structure. We probably can't change this. In any case `pygetpapers` should avoid this and use (say)  "abc_def/*.json"
* `getpapers` has not downloaded `PMC7822617/fulltext.xml`. `pygetpapers` has created a ZERO BYTES file. This will confuse people and should be omitted .