Merge #619 · meilisearch/milli@2e53924

This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Commit

Merge #619

619: Refactor the Facets databases to enable incremental indexing r=curquiza a=loiclec

# Pull Request

## What does this PR do?
Party fixes #605 by making the indexing of the facet databases (i.e. `facet_id_f64_docids` and `facet_id_string_docids`) incremental. It also closes #327 and meilisearch/meilisearch#2820 . Two more untracked bugs were also fixed:
1. The facet distribution algorithm did not respect the `maxFacetValues` parameter when there were only a few candidate document ids.
2. The structure of the levels > 0 of the facet databases were not updated following the deletion of documents

## How to review this PR

First, read this comment to get an overview of the changes.

Then, based on this comment, raise any concerns you might have about:
1. the new structure of the databases
2. the algorithms for sort, facet distribution, and range search
3. the new/removed heed codecs

Then, weigh in on the following concerns:
1. adding `fuzzcheck` as a fuzz-only dependency may add too much complexity for the benefits it provides
2. the `ByteSliceRef` and `StrRefCodec` are misnamed or should not exist
3. the new behaviour of facet distributions can be considered incorrect
4. incremental deletion is useless given that documents are always deleted in bulk

## What's left for me to do

1. Re-read everything once to make sure I haven't forgotten anything
2. Wait for the results of the benchmarks and see if (1) they provide enough information (2) there was any change in performance, especially for search queries. Then, maybe, spend some time optimising the code.
3. Test whether the `info`/`http-ui` crates survived the refactor

## Old structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Previously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format:
```
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │            1.2 – 2            │           3.4 – 100           │   102 – 104   │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   1.2 – 1.3   │    1.6 – 2    │   3.4 – 12    │  12.3 – 100   │   102 – 104   │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  1.2  │  1.3  │  1.6  │   2   │  3.4  │   12  │  12.3 │  100  │  102  │  104  │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where the first line is the key of the database, consisting of :
- the field id
- the level height
- the left and right bound of the group 

and the second line is the value of the database, consisting of:
- a bitmap of all the docids that have a facet value within the bounds

The `facet_id_string_docids` had a similar structure:
```
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             0 – 3             │             4 – 7             │     8 – 9     │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     0 – 1     │     2 – 3     │     4 – 5     │     6 – 7     │     8 – 9     │
│Level 1│   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where, **at level 0**, the key is:
* the normalised facet value (string)

and the value is:
* the original facet value (string)
* a bitmap of all the docids that have this normalised string facet value

**At level 1**, the key is:
* the left bound of the range as an index in level 0
* the right bound of the range as an index in level 0

and the value is:
* the left bound of the range as a normalised string
* the right bound of the range as a normalised string
* a bitmap of all the docids that have a string facet value within the bounds

**At level > 1**, the key is:
* the left bound of the range as an index in level 0
* the right bound of the range as an index in level 0

and the value is:
* a bitmap of all the docids that have a string facet value within the bounds

## New structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Now both the `facet_id_f64_docids` and `facet_id_string_docids` databases have the exact same structure:
```                                                                                             
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │           "ab" (2)            │           "gaf" (2)           │   "woz" (1)   │
│Level 2│   │                               │                               │               │
└───────┘   │        [a, b, d, f, z]        │        [c, d, e, f, g]        │    [u, y]     │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   "ab" (2)    │   "ba" (2)    │   "gaf" (2)   │  "form" (2)   │   "woz" (2)   │
│Level 1│   │               │               │               │               │               │
└───────┘   │ [a, b, d, z]  │   [a, b, f]   │   [c, d, g]   │    [e, f]     │    [u, y]     │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│  [g]  │  [e]  │ [e, f]│  [y]  │  [u]  │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where for all levels, the key is a `FacetGroupKey<T>` containing:
* the field id (`u16`)
* the level height (`u8`)
* the left bound of the range (`T`)

and the value is a `FacetGroupValue` containing:
* the number of elements from the level below that are part of the range (`u8`, =0 for level 0)
* a bitmap of all the docids that have a facet value within the bounds (`RoaringBitmap`)

The right bound of the range is now implicit, it is equal to `Excluded(next_left_bound)`.

In the code, the key is always encoded using `FacetGroupKeyCodec<C>` where `C` is the codec used to encode the facet value (either `OrderedF64Codec` or `StrRefCodec`) and the value is encoded with `FacetGroupValueCodec`.

Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e. `FacetGroupKey<&[u8]>` encoded as `FacetGroupKeyCodec<ByteSliceRef>`). This is, in my opinion, a big simplification.

The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the `facet_id_string_docids` used to store indices to `level 0` in all levels > 0, adding an element to level 0 would potentially invalidate all the indices.

Note that the original string value of a facet is no longer stored in this database.

## Incrementally adding a facet value

Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id `z` and facet value `gap`., then we want to add/modify the elements highlighted below in pink:
<img width="946" alt="Screenshot 2022-09-12 at 10 14 54" src="https://user-images.githubusercontent.com/6040237/189605532-fe4b0f52-e13d-4b3c-92d9-10c705953e3d.png">

which results in:
<img width="662" alt="Screenshot 2022-09-12 at 10 23 29" src="https://user-images.githubusercontent.com/6040237/189607015-c3a37588-b825-43c2-878a-f8f85c000b94.png">

* one element was added in level 0
* one key/value was modified in level 1
* one value was modified in level 2

Adding this element was easy since we could simply add it to level 0 and then increase the `group_size` part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (`max_group_size`), then we split the node into two. For example, let's imagine that `max_group_size` is `4` and we add the docid `y` with facet value `gas`. First, we add it in level 0:
<img width="904" alt="Screenshot 2022-09-12 at 10 30 40" src="https://user-images.githubusercontent.com/6040237/189608391-531f9df1-3424-4f1f-8344-73eb194570e5.png">
Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes:
<img width="919" alt="Screenshot 2022-09-12 at 10 33 16" src="https://user-images.githubusercontent.com/6040237/189608884-66f87635-1fc6-41d2-a459-87c995491ac4.png">
and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent:
<img width="915" alt="Screenshot 2022-09-12 at 10 34 42" src="https://user-images.githubusercontent.com/6040237/189609233-d4a893ff-254a-48a7-a5ad-c0dc337f23ca.png">

We also have two other parameters:
* `group_size` is the default group size when building the database from scratch
* `min_level_size` is the minimum number of elements that a level should contain

When the highest level size is greater than `group_size * min_level_size`, then we create an additional level above it.

There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting `"aa"` with the docid `w` would change the database to:
<img width="756" alt="Screenshot 2022-09-12 at 10 41 56" src="https://user-images.githubusercontent.com/6040237/189610637-a043ef71-7159-4bf1-b4fd-9903134fc095.png">

The root of the code for incremental indexing is the `FacetUpdateIncremental` builder.

## Incrementally removing a facet value
TODO: the algorithm was implemented and works, but its current API is: `fn delete(self, facet_value, single_docid)`. It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all.  

## Bulk insertion
While it's faster to incrementally add a single facet value to the database, it is sometimes **slower** to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The `FacetUpdateBulk` builder provides a way to do so. It works by:
1. clearing all levels > 0 from the DB
2. adding all new elements in level 0
3. rebuilding the higher levels from scratch 

The algorithm for bulk insertion is the same as the previous one.

## Choosing between incremental and bulk insertion
On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents.

This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times.

## Fuzz-testing

**Potentially controversial:**
I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for `group_size`, `max_group_size`, and `min_level_size`. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except for `unreachable/panic` statements and errors returned by `heed`.

The fuzz-test and the fuzzcheck dependency are only compiled when `cargo fuzzcheck` is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it. 

## Algorithms operating on the facet databases

There are four important algorithms making use of the facet databases:
1. Sort, ascending
2. Sort, descending
3. Facet distribution
4. Range search

Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string): `FacetNumberRange`, `FacetNumberRevRange`, `FacetNumberIter` (with a reversed and reducing/non-reducing option), `FacetStringGroupRange`, `FacetStringGroupRevRange`, `FacetStringLevel0Range`, `FacetStringLevel0RevRange`, and `FacetStringIter` (reversed + reducing/non-reducing). 

Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions:
1. `ascending_facet_sort` in `search/facet/facet_sort_ascending.rs`
2. `descending_facet_sort` in `search/facet/facet_sort_descending.rs`
3. `iterate_over_facet_distribution` in `search/facet/facet_distribution_iter.rs`
4. `find_docids_of_facet_within_bounds` in `search/facet/facet_range_search.rs`

I have tried to test them with some snapshot tests but more testing could still be done. I don't *think* that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks.

## Change of behaviour for facet distributions

Previously, the original string value of a facet was stored in the level 0 of `facet_id_string_docids `. This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one in `field_id_docid_facet_strings`. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example,
```json
{ "id": 0, "colour": "RED" }
{ "id": 1, "colour": "red" }
```
Facet distribution for the `colour` field among the candidates `[1]`:
```
{ "RED": 1 }
```
Here, "RED" was given as the original facet value even though it does not appear in the document id `1`.

## Heed codecs

A number of heed codecs related to the facet databases were removed:
* `FacetLevelValueF64Codec`
* `FacetLevelValueU32Codec`
* `FacetStringLevelZeroCodec`
* `StringValueCodec`
* `FacetStringZeroBoundsValueCodec`
* `FacetValueStringCodec`
* `FieldDocIdFacetStringCodec`
* `FieldDocIdFacetF64Codec`

They were replaced by:
* `FacetGroupKeyCodec<C>` (replaces all key codecs for the facet databases)
* `FacetGroupValueCodec` (replaces all value codecs for the facet databases)
* `FieldDocIdFacetCodec<C>` (replaces `FieldDocIdFacetStringCodec` and `FieldDocIdFacetF64Codec`)

Since the associated encoded item of `FacetGroupKeyCodec<C>` is `FacetKey<T>` and we often work with `FacetKey<&[u8]>` and `FacetKey<&str>`, then we need to have codecs that encode values of type `&str` and `&[u8]`. The existing `ByteSlice` and `Str` codecs do not work for that purpose (their `EItem` are `[u8]` and `str`), I have also created two new codecs:
* `ByteSliceRef` is a codec with a `EItem = DItem = &[u8]`
* `StrRefCodec` is a codec with a `EItem = DItem = &str`

I have also factored out the code used to encode an ordered f64 into its own `OrderedF64Codec`.


Co-authored-by: Loïc Lecrenier <[email protected]>

Loading branch information

bors[bot] and Loïc Lecrenier authored Oct 26, 2022

2 parents 365f44c + 54c0cf9 commit 2e53924

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -2,6 +2,8 @@ @@
     /target
     /Cargo.lock
+    milli/target/
     # datasets
     *.csv
     *.mmdb
@@ Expand All / @@ -11,6 +13,8 @@ @@
     # Snapshots
     ## ... large
     *.full.snap
-    #  ... unreviewed
+    ## ... unreviewed
     *.snap.new
+    # Fuzzcheck data for the facet indexing fuzz test
+    milli/fuzz/update::facet::incremental::fuzz::fuzz/

milli/Cargo.toml

-Original file line number
+Diff line change
@@ Expand Up / @@ -54,7 +54,10 @@ big_s = "1.0.2" @@
     insta = "1.21.0"
     maplit = "1.0.2"
     md5 = "0.7.0"
-    rand = "0.8.5"
+    rand = {version = "0.8.5", features = ["small_rng"] }
+    [target.'cfg(fuzzing)'.dev-dependencies]
+    fuzzcheck = "0.12.1"
     [features]
     default = [ "charabia/default" ]
@@ Expand Down @@

milli/src/heed_codec/byte_slice_ref.rs

-Original file line number
+Diff line change
@@ -0,0 +1,23 @@
+    use std::borrow::Cow;
+    use heed::{BytesDecode, BytesEncode};
+    /// A codec for values of type `&[u8]`. Unlike `ByteSlice`, its `EItem` and `DItem` associated
+    /// types are equivalent (= `&'a [u8]`) and these values can reside within another structure.
+    pub struct ByteSliceRefCodec;
+    impl<'a> BytesEncode<'a> for ByteSliceRefCodec {
+        type EItem = &'a [u8];
+        fn bytes_encode(item: &'a Self::EItem) -> Option<Cow<'a, [u8]>> {
+            Some(Cow::Borrowed(item))
+        }
+    }
+    impl<'a> BytesDecode<'a> for ByteSliceRefCodec {
+        type DItem = &'a [u8];
+        fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
+            Some(bytes)
+        }
+    }

milli/src/heed_codec/facet/facet_level_value_f64_codec.rs

This file was deleted.

milli/src/heed_codec/facet/facet_level_value_u32_codec.rs

This file was deleted.

milli/src/heed_codec/facet/facet_string_level_zero_codec.rs

This file was deleted.

milli/src/heed_codec/facet/facet_string_level_zero_value_codec.rs

-Original file line number
+Diff line change
@@ -1,90 +0,0 @@
-    use std::borrow::Cow;
-    use std::convert::TryInto;
-    use std::{marker, str};
-    use crate::error::SerializationError;
-    use crate::heed_codec::RoaringBitmapCodec;
-    use crate::{try_split_array_at, try_split_at, Result};
-    pub type FacetStringLevelZeroValueCodec = StringValueCodec<RoaringBitmapCodec>;
-    /// A codec that encodes a string in front of a value.
-    ///
-    /// The usecase is for the facet string levels algorithm where we must know the
-    /// original string of a normalized facet value, the original values are stored
-    /// in the value to not break the lexicographical ordering of the LMDB keys.
-    pub struct StringValueCodec<C>(marker::PhantomData<C>);
-    impl<'a, C> heed::BytesDecode<'a> for StringValueCodec<C>
-    where
-        C: heed::BytesDecode<'a>,
-    {
-        type DItem = (&'a str, C::DItem);
-        fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
-            let (string, bytes) = decode_prefix_string(bytes)?;
-            C::bytes_decode(bytes).map(|item| (string, item))
-        }
-    }
-    impl<'a, C> heed::BytesEncode<'a> for StringValueCodec<C>
-    where
-        C: heed::BytesEncode<'a>,
-    {
-        type EItem = (&'a str, C::EItem);
-        fn bytes_encode((string, value): &'a Self::EItem) -> Option<Cow<[u8]>> {
-            let value_bytes = C::bytes_encode(value)?;
-            let mut bytes = Vec::with_capacity(2 + string.len() + value_bytes.len());
-            encode_prefix_string(string, &mut bytes).ok()?;
-            bytes.extend_from_slice(&value_bytes[..]);
-            Some(Cow::Owned(bytes))
-        }
-    }
-    pub fn decode_prefix_string(value: &[u8]) -> Option<(&str, &[u8])> {
-        let (original_length_bytes, bytes) = try_split_array_at(value)?;
-        let original_length = u16::from_be_bytes(original_length_bytes) as usize;
-        let (string, bytes) = try_split_at(bytes, original_length)?;
-        let string = str::from_utf8(string).ok()?;
-        Some((string, bytes))
-    }
-    pub fn encode_prefix_string(string: &str, buffer: &mut Vec<u8>) -> Result<()> {
-        let string_len: u16 =
-            string.len().try_into().map_err(|_| SerializationError::InvalidNumberSerialization)?;
-        buffer.extend_from_slice(&string_len.to_be_bytes());
-        buffer.extend_from_slice(string.as_bytes());
-        Ok(())
-    }
-    #[cfg(test)]
-    mod tests {
-        use heed::types::Unit;
-        use heed::{BytesDecode, BytesEncode};
-        use roaring::RoaringBitmap;
-        use super::*;
-        #[test]
-        fn deserialize_roaring_bitmaps() {
-            let string = "abc";
-            let docids: RoaringBitmap = (0..100).chain(3500..4398).collect();
-            let key = (string, docids.clone());
-            let bytes = StringValueCodec::<RoaringBitmapCodec>::bytes_encode(&key).unwrap();
-            let (out_string, out_docids) =
-                StringValueCodec::<RoaringBitmapCodec>::bytes_decode(&bytes).unwrap();
-            assert_eq!((out_string, out_docids), (string, docids));
-        }
-        #[test]
-        fn deserialize_unit() {
-            let string = "def";
-            let key = (string, ());
-            let bytes = StringValueCodec::<Unit>::bytes_encode(&key).unwrap();
-            let (out_string, out_unit) = StringValueCodec::<Unit>::bytes_decode(&bytes).unwrap();
-            assert_eq!((out_string, out_unit), (string, ()));
-        }
-    }

0 comments on commit `2e53924`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `2e53924`

Commit

There are no files selected for viewing

0 comments on commit 2e53924

0 comments on commit `2e53924`