-
Notifications
You must be signed in to change notification settings - Fork 4
Searching
The search field takes a search expression as input. A search expression is one of:
- Search phrase
-
?expression
(optional) (expression)
-
expression | expression
(or), e.g.go ( left | right )
To use literal ?
, (
, )
or |
, a backslash (\
) needs to be placed in front
of the character.
A search phrase is a sequence of tokens where a token is one of:
- Unannotated (fuzzy match, e.g.
hello
means fuzzy matchhello
) -
'tok
('
prefix means exact match the token) -
^tok
(^
prefix means prefix match the token) -
tok$
($
suffix means suffix match the token) -
~
(explicit spaces, i.e. contiguous sequence of spaces, tabs, etc)
Tokens that are not separated by spaces, operators, or parentheses
are treated specially, we call these linked tokens. For example,
12
, :
, 30
are linked in 12:30
, but not in 12 : 30
. Linked
tokens have a much stricter search distance by default, e.g. in
12:30
, Docfd will search for :
only up to a few tokens away
from 12
, and so on. This allows user to state intention of
reduced fuzziness.
To link spaces to tokens, one needs to be make use of ~
. For
example, to search for "John Smith" ("John" and "Smith" separated
by some number of spaces), one can use John~Smith
to establish
linkage.
For '
, ^
, $
to be considered annotation markers, there cannot
be space between the marker and token, e.g. ^abc
means "prefix
match abc
", but ^ abc
means "fuzzy match ^
and fuzzy match
abc
".
Annotated linked tokens are also treated specially:
-
^12:30
is equivalent to'12
':
^30
-
'12:30
is equivalent to'12
':
'30
-
12:30$
is equivalent to12$
':
'30
But with even stricter search restriction than the normal linked
tokens, namely the next matching token must follow immediately from
the current match, e.g. ^12:3
will not match 12 : 30
but will
match 12:30
Search is asynchronous, specifically:
- Editing of search field is not blocked by search progress
- Updating/clearing the search field cancels the current search and starts a new search immediately
For a phrase with optional operator, such as ?word0 word1 ...
,
the first word is grouped implicitly,
i.e. it is treated as (?word0) word1 ...
.
Document content and user input in the search field are tokenized/segmented in the same way, based on:
- Contiguous alphanumeric characters
- Individual symbols
- Individual UTF-8 characters
- Spaces
A search phrase is a list of said tokens.
Search procedure is a DFS through the document index, where the search range for a word is fixed to a configured range surrounding the previous word (when applicable).
A token in the index matches a token in the search phrase if they fall into one of the following cases:
- They are a case-insensitive exact match
- They are a case-insensitive substring match (token in search phrase being the substring)
- They are within the configured case-insensitive edit distance threshold
Search results are then ranked using a heuristic.