Skip to content

Boolean queries

petermr edited this page Jul 26, 2021 · 5 revisions

Boolean queries

Most simple queries are often too broad (e.g. "Climate change" => 114275 hits) or too narrow ("TPS69" => 2). The normal approach is to add Boolean constraints (AND, OR, NOT). Some researchers, especially when doing systematic reviews, create complex nested combination of Boolean clauses (I've seen over a page of dense text). Most repositories support these, but it's easy to make mistakes when creating them.

query building

To support the query-er some sites provide selectable clauses (e.g. EPMC's advanced query builder https://europepmc.org/advancesearch). Here's a typical query (try it yourself)

(ABSTRACT:"blood") AND (IN_EPMC:y) AND (JOURNAL:"BMJ" OR JOURNAL:"Br Med J" OR JOURNAL:"Br Med J Clin Res Ed" 
OR JOURNAL:"Lond J Med" OR JOURNAL:"Prov Med Surg J" OR JOURNAL:"Prov Med J Retrosp Med Sci" 
OR JOURNAL:"Prov Med Surg J (1840)" OR JOURNAL:"Assoc Med J") AND (PUB_TYPE:"Journal Article" 
OR PUB_TYPE:"article-commentary" OR PUB_TYPE:"research-article" OR PUB_TYPE:"protocol" 
OR PUB_TYPE:"rapid-communication" OR PUB_TYPE:"product-review")

But even this is limited; there is no NOT function in the query builder (as far as I know)

Manually built queries

Here are some examples. We are trying to find "terpene synthases" (a type of gene whose expressed protein enzymes make organic molecules, often in essential oils.

pygetpapers -q 'terpene synthase' -n
INFO: Final query is terpene synthase
INFO: Total number of hits for the query are 4274

There are many synthases, usually named TPS1, TPS2 ...

pygetpapers -q 'TPS1' -n
INFO: Final query is TPS1
INFO: Total number of hits for the query are 1079

We may want to get all the variants:

pygetpapers -q 'TPS1 OR TPS2' -n
INFO: Final query is TPS1 OR TPS2
INFO: Total number of hits for the query are 1283

but there are many variants, and creating queries with all of them is slightly error-prone. Moreover there are other meanings of "TPS"

pygetpapers -q '(TPS1 OR TPS2) NOT "thermoplastic starch "' -n
INFO: Final query is TPS1 OR TPS2 NOT "thermoplastic starch "
INFO: Total number of hits for the query are 1281

This has cleaned up the hit list slightly (1283 => 1281)

pygetpapers term-lists

A quick partial solution to this problem is to make list of what we want and what we don't (stopwords). There are several strategies:

  1. query AND (A OR B OR C...) broad query, made narrower with list (currently implemented, not ideal)
  2. A OR B OR C... OR'ing of precise queries in list (desirable)
  3. query NOT (D OR E...) broad query , narrowed by stopword list (desirable)
  4. (A OR B OR C...) NOT (D OR E ...) (desirable)

pygetpapers uses a list of terms in a file (--terms FILE). Current usage is:

-q QUERY --terms TERMFILE 

The termfile is a comma-separated list, e.g tps_terms_2.txt contains:

TPS1,TPS2

Example

Find all papers on essential oils with either TPS1 or TPS2

pygetpapers -q '"essential oil"' --term tps_terms_2.txt -n
INFO: Final query is ("essential oil" AND (TPS1 OR TPS2
))
INFO: Total number of hits for the query are 21

Ignoring the formatting this is the query we want. We can use larger lists:

pygetpapers -q 'methods' --terms tps_terms_50.txt -n
INFO: Final query is (methods AND (TPS1 OR TPS2 OR TPS3 OR TPS4 OR TPS5 OR TPS6 OR TPS7 OR TPS8 OR TPS9 OR TPS10 OR 
TPS11 OR TPS12 OR TPS13 OR TPS14 OR TPS15 OR TPS16 OR TPS17 OR TPS18 OR TPS19 OR TPS20 OR 
TPS21 OR TPS22 OR TPS23 OR TPS24 OR TPS25 OR TPS26 OR TPS27 OR TPS28 OR TPS29 OR TPS30 OR 
TPS31 OR TPS32 OR TPS33 OR TPS34 OR TPS35 OR TPS36 OR TPS37 OR TPS38 OR TPS39 OR TPS40 OR 
TPS41 OR TPS42 OR TPS43 OR TPS44 OR TPS45 OR TPS46 OR TPS47 OR TPS48 OR TPS49 OR TPS50
))
INFO: Total number of hits for the query are 1442

We have concatenated 50 terms successfully.

problems

  • EPMC does not see to detect some malformed queries:
pygetpapers -q '(TPS1 OR TPS2)' -n
INFO: Final query is (TPS1 OR TPS2)
INFO: Total number of hits for the query are 1283
$ pygetpapers -q '((TPS1 OR TPS2)' -n
INFO: Final query is ((TPS1 OR TPS2)
INFO: Total number of hits for the query are 392
 

I have checked on the EPMC site and this is a real effect - a malformed query gives the wrong result. Therefore it's critical to make sure queries are syntactically correct.

pygetpapers -q '(TPS1 OR TPS2 OR)' -n
INFO: Final query is (TPS1 OR TPS2 OR)
INFO: Total number of hits for the query are 392

The hanging OR gives a similar problem.

conclusion

--term is a great development but needs slight adjustment:

  • chained ORs often do not need a query, so --term should create a complete
  • spurious commas will create an incorrect query
  • multiple files would be valuable --term file1.txt file2.txt would OR them
  • how to include AND and NOT logic?