Suggestions for Indexing Strategies? #308

nmaswood · 2021-09-09T20:58:43Z

nmaswood
Sep 9, 2021

Hello, first of all thank you all so much for building / maintaining this package!

Second of all, I have not benchmarked anything so please let me know if this question is impossible / impractical to ask with out more concrete numbers.

I am trying to understand if in general, it is better to have an index per customer or a mono-index.

We have a multi-tenant application with up to hundreds of customers.
we are currently using embedding sizes of 128
most customers have on the order of thousands embeddings, but customers on the long tail may have up to millions.

In previous experience with elastic search I always learned that you must be careful to avoid the
gazillion shards problem and prefer a few shards in the tens of gigabytes as opposed to many smaller shards.

However, given my understanding of LSH:

Specifically, we compute exact similarity for the top candidates candidate vectors in each segment. As a reminder, each Elasticsearch index has >= 1 shards, and each shard has >= 1 segments.

It seems that you want to avoid large shards because extremely large shards will result in large segments, which will require potentially a very high number for candidates.

In general, would you recommend optimizing for:

a smaller quantity of large shards
a larger quantity of small shards
or there is no rule of thumb and it really can only be determined by benchmarking

Answered by alexklibisz

Sep 11, 2021

because candidates gets ignored when we use Filtered Subsets does it mean that instead of:
3 * 100 computations we have 3 * 1_000_000 computations?

It shouldn't. AFAIK the function score just takes the first size docs that match the given query and runs the specified function(s) on that subset. The maximum size is 10k, so it will at most run exact knn on 3 * 10k vectors. There's some more discussion about this on a recent issue #298 .

This is less related to having large indices, but just thought I would ask. Does this mean documents will be ordered differently?

They could be ordered differently. If you're going to use the function score query, I would recommend first trying it with e…

View full answer

alexklibisz · 2021-09-10T00:50:23Z

alexklibisz
Sep 10, 2021
Maintainer

Hey, that sounds like a neat use-case. Sharding is a tricky problem, especially in a multi-tenant use-case. Here are some thoughts

a smaller quantity of large shards

Anecdotally, I've had the best experience with this approach on the benchmarking I've done on ann-benchmarks datasets. No guarantee that this is representative, but I can't think of anything super special about Elastiknn that would totally invalidate the index-sizing best practices of Elasticsearch.

It seems that you want to avoid large shards because extremely large shards will result in large segments, which will require potentially a very high number for candidates.

I think the only case where you'd need to increase candidates is if for some reason a tenant is bound to a specific segment. For example, compare the two cases:

A tenant with all or most vectors in one segment.
A tenant with vectors distributed roughly evenly across segments.

You would probably want more candidates for (1) than for (2). You would also pay an additional penalty in (1), because each segment is queried serially. Parallelism is only possible w/ multiple segments. If you have n threads and n segments and you have a healthy saturation of queries, then your query will generally be n times faster compared to having n threads and 1 segment.

It sounds like you know this, but just to be specific, the candidates are evaluated per segment. So if you set candidates=c with a total of n segments in the index, you'll end up evaluating up to n * c candidates, i.e., you'll end up computing exact KNN on n * c vectors. It might be less, but should not be more, unless there is some bug :).

I think the only definitive mistake you could make is keeping each tenant's data in one segment. I doubt you would do this, as ES doesn't really even give you a way to do this intentionally IIRC, but just something to keep in mind.

1 reply

nmaswood Sep 10, 2021
Author

Hey Alex,

Thank you so much for the prompt reply!

As a super quick follow up, I would like to use Querying on a Filtered Subset of Documents to filter documents. I would like to better understand caveats of using Filter Subsets especially in the context of large indices.

1. Will a large mono-index cause performance problems?

When using "model": "lsh", the "candidates" parameter is ignored

If I have the following setup:

1 relatively large index (3 million documents)
3 segments with (1 million documents each)
candidates = 100
Filtered Subset of Documents with match_all query

because candidates gets ignored when we use Filtered Subsets does it mean that instead of:

3 * 100 computations we have 3 * 1_000_000 computations?

2. What kind of scoring fidelity should we expect to lose?

vectors are not re-scored with the exact similarity like they are with a elastiknn_nearest_neighbors

This is less related to having large indices, but just thought I would ask. Does this mean documents will be ordered differently?

Thank you so much for your help 🙏🏼

alexklibisz · 2021-09-11T17:08:16Z

alexklibisz
Sep 11, 2021
Maintainer

because candidates gets ignored when we use Filtered Subsets does it mean that instead of:
3 * 100 computations we have 3 * 1_000_000 computations?

It shouldn't. AFAIK the function score just takes the first size docs that match the given query and runs the specified function(s) on that subset. The maximum size is 10k, so it will at most run exact knn on 3 * 10k vectors. There's some more discussion about this on a recent issue #298 .

This is less related to having large indices, but just thought I would ask. Does this mean documents will be ordered differently?

They could be ordered differently. If you're going to use the function score query, I would recommend first trying it with exact knn (i.e., model=exact). Then you'll get the same scores as a regular query, and it's going to hit at most 10k docs/segment, which is generally still well under a second of latency. For example, single-segment, single-threaded exact knn on the fashion-mnist dataset (60k 784-dimensional vectors) yields about 8 queries/second.

1 reply

nmaswood Sep 13, 2021
Author

Alex, thank you so much! I am truly grateful for your responses here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for Indexing Strategies? #308

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Suggestions for Indexing Strategies? #308

nmaswood Sep 9, 2021

Replies: 2 comments · 2 replies

alexklibisz Sep 10, 2021 Maintainer

nmaswood Sep 10, 2021 Author

1. Will a large mono-index cause performance problems?

2. What kind of scoring fidelity should we expect to lose?

alexklibisz Sep 11, 2021 Maintainer

nmaswood Sep 13, 2021 Author

nmaswood
Sep 9, 2021

Replies: 2 comments 2 replies

alexklibisz
Sep 10, 2021
Maintainer

nmaswood Sep 10, 2021
Author

alexklibisz
Sep 11, 2021
Maintainer

nmaswood Sep 13, 2021
Author