Suggestions for Indexing Strategies? #308
-
Hello, first of all thank you all so much for building / maintaining this package! Second of all, I have not benchmarked anything so please let me know if this question is impossible / impractical to ask with out more concrete numbers. I am trying to understand if in general, it is better to have an index per customer or a mono-index.
In previous experience with elastic search I always learned that you must be careful to avoid the However, given my understanding of LSH:
It seems that you want to avoid large shards because extremely large shards will result in large segments, which will require potentially a very high number for In general, would you recommend optimizing for:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hey, that sounds like a neat use-case. Sharding is a tricky problem, especially in a multi-tenant use-case. Here are some thoughts
Anecdotally, I've had the best experience with this approach on the benchmarking I've done on ann-benchmarks datasets. No guarantee that this is representative, but I can't think of anything super special about Elastiknn that would totally invalidate the index-sizing best practices of Elasticsearch.
I think the only case where you'd need to increase candidates is if for some reason a tenant is bound to a specific segment. For example, compare the two cases:
You would probably want more candidates for (1) than for (2). You would also pay an additional penalty in (1), because each segment is queried serially. Parallelism is only possible w/ multiple segments. If you have n threads and n segments and you have a healthy saturation of queries, then your query will generally be n times faster compared to having n threads and 1 segment. It sounds like you know this, but just to be specific, the candidates are evaluated per segment. So if you set candidates=c with a total of n segments in the index, you'll end up evaluating up to n * c candidates, i.e., you'll end up computing exact KNN on n * c vectors. It might be less, but should not be more, unless there is some bug :). I think the only definitive mistake you could make is keeping each tenant's data in one segment. I doubt you would do this, as ES doesn't really even give you a way to do this intentionally IIRC, but just something to keep in mind. |
Beta Was this translation helpful? Give feedback.
-
It shouldn't. AFAIK the function score just takes the first
They could be ordered differently. If you're going to use the function score query, I would recommend first trying it with exact knn (i.e., model=exact). Then you'll get the same scores as a regular query, and it's going to hit at most 10k docs/segment, which is generally still well under a second of latency. For example, single-segment, single-threaded exact knn on the fashion-mnist dataset (60k 784-dimensional vectors) yields about 8 queries/second. |
Beta Was this translation helpful? Give feedback.
It shouldn't. AFAIK the function score just takes the first
size
docs that match the given query and runs the specified function(s) on that subset. The maximumsize
is 10k, so it will at most run exact knn on 3 * 10k vectors. There's some more discussion about this on a recent issue #298 .They could be ordered differently. If you're going to use the function score query, I would recommend first trying it with e…