-
Notifications
You must be signed in to change notification settings - Fork 9
Comparison of `getpapers` and `pygetpapers`
petermr edited this page Apr 17, 2021
·
3 revisions
getpapers
and pygetpapers
will continue to be used interchangeably for some time. It's important that:
- the differences are minor
- the differences are documented
The searches covid20
and ebola1
give small numbers of hits (currently <10) and are unlikely to change very rapidly. They are probably misprints or incorrect names.
getpapers -q ebola1 -a -x -o ebola1g
info: Searching using eupmc API
(node:61796) Warning: Accessing non-existent property 'padLevels' of module exports inside circular dependency
(Use `node --trace-warnings ...` to show where the warning was created)
info: Found 10 results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.5 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
warn: Article with pmcid "PMC7822617" was not Open Access (therefore no XML)
warn: Article with doi "10.2139/ssrn.3640311 did not have a PMCID (therefore no XML)
warn: Article with doi "10.1101/2020.03.29.014209 did not have a PMCID (therefore no XML)
warn: Article with doi "10.21203/rs.3.rs-29546/v1 did not have a PMCID (therefore no XML)
info: Got XML URLs for 6 out of 10 results
info: Downloading fulltext XML files
Downloading files [==============================] 100% (6/6) [0.0s elapsed, eta 0.0]
info: All downloads succeeded!
gives:
tree ebola1g/
ebola1g/
├── 10.1101
│ └── 2020.03.29.014209
│ └── eupmc_result.json
├── 10.21203
│ └── rs.3.rs-29546
│ └── v1
│ └── eupmc_result.json
├── 10.2139
│ └── ssrn.3640311
│ └── eupmc_result.json
├── PMC4339239
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC4493273
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC4587849
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC5630589
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC7130407
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC7436544
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC7822617
│ └── eupmc_result.json
├── eupmc_fulltext_html_urls.txt
└── eupmc_results.json
14 directories, 18 files
(The command -a
is not recognised but should be re-instated for compatibility).
pygetpapers -q ebola1 -x -o ebola1p
INFO: Total Hits are 10
INFO: Total Hits are 10
WARNING: Could not find more papers
WARNING: html url not found for paper 1
WARNING: Abstract not found for paper 1
WARNING: Keywords not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Author list not found for paper 1
WARNING: Abstract not found for paper 2
WARNING: Keywords not found for paper 2
WARNING: Keywords not found for paper 3
WARNING: Abstract not found for paper 4
WARNING: Keywords not found for paper 4
WARNING: Author list not found for paper 4
WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 6
WARNING: Author list not found for paper 6
WARNING: Keywords not found for paper 7
INFO: Saving XML files to /Users/pm286/projects/aroma_game/ebola1p/*/fulltext.xml
INFO: */Wrote xml for PMC7822617/
INFO: */Wrote xml for PMC7436544/
INFO: */Wrote xml for PMC4587849/
INFO: */Wrote xml for PMC7130407/
INFO: */Wrote xml for PMC5630589/
INFO: */Wrote xml for PMC4339239/
INFO: */Wrote xml for PMC4493273/
giving
$ tree ebola1p
ebola1p
├── PMC4339239
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC4493273
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC4587849
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC5630589
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC7130407
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC7436544
│ ├── eupmc_result.json
│ └── fulltext.xml
├── PMC7822617
│ ├── eupmc_result.json
│ └── fulltext.xml
└── eupmc_results.json
# differences
* `getpapers` downloads the `-a` metadata. This should be included in `pygetpapers`
* `getpapers` includes all articles, not just `PMC...`. This should be included in `pygetpapers`
* The INFO and WARNINGS are different
* `pygetpapers` should not use "paper 1", etc. Use the IDs
* `getpapers` writes eupmc_fulltext_html_urls.txt. This is useful.
* `getpapers` mangles filenames "abc/def/*.json" into a false directory structure. We probably can't change this. In any case `pygetpapers` should avoid this and use (say) "abc_def/*.json"
* `getpapers` has not downloaded `PMC7822617/fulltext.xml`. `pygetpapers` has created a ZERO BYTES file. This will confuse people and should be omitted .