-
Notifications
You must be signed in to change notification settings - Fork 9
Boolean queries
Most simple queries are often too broad (e.g. "Climate change" => 114275 hits) or too narrow ("TPS69" => 2). The normal approach is to add Boolean constraints (AND, OR, NOT). Some researchers, especially when doing systematic reviews, create complex nested combination of Boolean clauses (I've seen over a page of dense text). Most repositories support these, but it's easy to make mistakes when creating them.
To support the query-er some sites provide selectable clauses (e.g. EPMC's advanced query builder https://europepmc.org/advancesearch). Here's a typical query (try it yourself)
(ABSTRACT:"blood") AND (IN_EPMC:y) AND (JOURNAL:"BMJ" OR JOURNAL:"Br Med J" OR JOURNAL:"Br Med J Clin Res Ed"
OR JOURNAL:"Lond J Med" OR JOURNAL:"Prov Med Surg J" OR JOURNAL:"Prov Med J Retrosp Med Sci"
OR JOURNAL:"Prov Med Surg J (1840)" OR JOURNAL:"Assoc Med J") AND (PUB_TYPE:"Journal Article"
OR PUB_TYPE:"article-commentary" OR PUB_TYPE:"research-article" OR PUB_TYPE:"protocol"
OR PUB_TYPE:"rapid-communication" OR PUB_TYPE:"product-review")
But even this is limited; there is no NOT function in the query builder (as far as I know)
Here are some examples. We are trying to find "terpene synthases" (a type of gene whose expressed protein enzymes make organic molecules, often in essential oils.
pygetpapers -q 'terpene synthase' -n
INFO: Final query is terpene synthase
INFO: Total number of hits for the query are 4274
There are many synthases, usually named TPS1
, TPS2
...
pygetpapers -q 'TPS1' -n
INFO: Final query is TPS1
INFO: Total number of hits for the query are 1079
We may want to get all the variants:
pygetpapers -q 'TPS1 OR TPS2' -n
INFO: Final query is TPS1 OR TPS2
INFO: Total number of hits for the query are 1283
but there are many variants, and creating queries with all of them is slightly error-prone. Moreover there are other meanings of "TPS"
pygetpapers -q '(TPS1 OR TPS2) NOT "thermoplastic starch "' -n
INFO: Final query is TPS1 OR TPS2 NOT "thermoplastic starch "
INFO: Total number of hits for the query are 1281
This has cleaned up the hit list slightly (1283 => 1281)
A quick partial solution to this problem is to make list of what we want and what we don't (stopwords). There are several strategies:
-
query AND (A OR B OR C...)
broad query, made narrower with list (currently implemented, not ideal) -
A OR B OR C...
OR'ing of precise queries in list (desirable) -
query NOT (D OR E...)
broad query , narrowed by stopword list (desirable) -
(A OR B OR C...) NOT (D OR E ...)
(desirable)
pygetpapers
uses a list of terms in a file (--terms FILE
). Current usage is:
-q QUERY --terms TERMFILE
The termfile is a comma-separated list, e.g tps_terms_2.txt
contains:
TPS1,TPS2
Find all papers on essential oils with either TPS1 or TPS2
pygetpapers -q '"essential oil"' --term tps_terms_2.txt -n
INFO: Final query is ("essential oil" AND (TPS1 OR TPS2
))
INFO: Total number of hits for the query are 21
Ignoring the formatting this is the query we want. We can use larger lists:
pygetpapers -q 'methods' --terms tps_terms_50.txt -n
INFO: Final query is (methods AND (TPS1 OR TPS2 OR TPS3 OR TPS4 OR TPS5 OR TPS6 OR TPS7 OR TPS8 OR TPS9 OR TPS10 OR
TPS11 OR TPS12 OR TPS13 OR TPS14 OR TPS15 OR TPS16 OR TPS17 OR TPS18 OR TPS19 OR TPS20 OR
TPS21 OR TPS22 OR TPS23 OR TPS24 OR TPS25 OR TPS26 OR TPS27 OR TPS28 OR TPS29 OR TPS30 OR
TPS31 OR TPS32 OR TPS33 OR TPS34 OR TPS35 OR TPS36 OR TPS37 OR TPS38 OR TPS39 OR TPS40 OR
TPS41 OR TPS42 OR TPS43 OR TPS44 OR TPS45 OR TPS46 OR TPS47 OR TPS48 OR TPS49 OR TPS50
))
INFO: Total number of hits for the query are 1442
We have concatenated 50 terms successfully.
- EPMC does not see to detect some malformed queries:
pygetpapers -q '(TPS1 OR TPS2)' -n
INFO: Final query is (TPS1 OR TPS2)
INFO: Total number of hits for the query are 1283
$ pygetpapers -q '((TPS1 OR TPS2)' -n
INFO: Final query is ((TPS1 OR TPS2)
INFO: Total number of hits for the query are 392
I have checked on the EPMC site and this is a real effect - a malformed query gives the wrong result. Therefore it's critical to make sure queries are syntactically correct.
pygetpapers -q '(TPS1 OR TPS2 OR)' -n
INFO: Final query is (TPS1 OR TPS2 OR)
INFO: Total number of hits for the query are 392
The hanging OR
gives a similar problem.
--term
is a great development but needs slight adjustment:
- chained
OR
s often do not need a query, so--term
should create a complete - spurious commas will create an incorrect query
- multiple files would be valuable
--term file1.txt file2.txt
wouldOR
them - how to include
AND
andNOT
logic?