Skip to content

Searching

Darren Li edited this page Nov 27, 2024 · 3 revisions

Search expression

The search field takes a search expression as input. A search expression is one of:

  • Search phrase
  • ?expression (optional)
  • (expression)
  • expression | expression (or), e.g. go ( left | right )

To use literal ?, (, ) or |, a backslash (\) needs to be placed in front of the character.

A search phrase is a sequence of tokens where a token is one of:

  • Unannotated (fuzzy match, e.g. hello means fuzzy match hello)
  • 'tok (' prefix means exact match the token)
  • ^tok (^ prefix means prefix match the token)
  • tok$ ($ suffix means suffix match the token)
  • ~ (explicit spaces, i.e. contiguous sequence of spaces, tabs, etc)

Tokens that are not separated by spaces, operators, or parentheses are treated specially, we call these linked tokens. For example, 12, :, 30 are linked in 12:30, but not in 12 : 30. Linked tokens have a much stricter search distance by default, e.g. in 12:30, Docfd will search for : only up to a few tokens away from 12, and so on. This allows user to state intention of reduced fuzziness.

To link spaces to tokens, one needs to be make use of ~. For example, to search for "John Smith" ("John" and "Smith" separated by some number of spaces), one can use John~Smith to establish linkage.

For ', ^, $ to be considered annotation markers, there cannot be space between the marker and token, e.g. ^abc means "prefix match abc", but ^ abc means "fuzzy match ^ and fuzzy match abc".

Annotated linked tokens are also treated specially:

  • ^12:30 is equivalent to '12 ': ^30
  • '12:30 is equivalent to '12 ': '30
  • 12:30$ is equivalent to 12$ ': '30

But with even stricter search restriction than the normal linked tokens, namely the next matching token must follow immediately from the current match, e.g. ^12:3 will not match 12 : 30 but will match 12:30

Search is asynchronous, specifically:

  • Editing of search field is not blocked by search progress
  • Updating/clearing the search field cancels the current search and starts a new search immediately

Optional operator handling specifics

For a phrase with optional operator, such as ?word0 word1 ..., the first word is grouped implicitly, i.e. it is treated as (?word0) word1 ....

Search phrase and search procedure

Document content and user input in the search field are tokenized/segmented in the same way, based on:

  • Contiguous alphanumeric characters
  • Individual symbols
  • Individual UTF-8 characters
  • Spaces

A search phrase is a list of said tokens.

Search procedure is a DFS through the document index, where the search range for a word is fixed to a configured range surrounding the previous word (when applicable).

A token in the index matches a token in the search phrase if they fall into one of the following cases:

  • They are a case-insensitive exact match
  • They are a case-insensitive substring match (token in search phrase being the substring)
  • They are within the configured case-insensitive edit distance threshold

Search results are then ranked using a heuristic.