-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison with pgvecto.rs #77
Comments
Thank you for pointing out the issue. Our README doesn't provide detailed explanations, and we will work on improving it. Regarding IVF and HNSW, I believe they should also be included in the README. What do you think, @VoVAllen? |
We open sources in a hurry and will write more blogs about implementation details and performance comparisons in the coming weeks. Why IVF can be faster than HNSW is a bit long story. But I'll cover the essentials here. Most of the time spent by vector search algorithms is on distance computation. To improve speed, we need to minimize distance comparisons as much as possible. The original IVF performs poorly in this regard, typically requiring a scan of 1% to 5% of the total vectors, which is significantly more than HNSW. Quantization introduces a new approach, allowing you to compress a 32-bit vector into just 1 bit (like RaBitQ) or even less (like PQ in Faiss). While this compression results in some loss of precision, it greatly reduces computational requirements. With the fast scan optimization, we can achieve computations that are over 32 times faster than traditional vector distance calculations. We can then rerank additional vectors to achieve a better recall rate. The full precision distance computation is only necessary during the reranking phase. This is why IVF can be faster than HNSW. The same technique can also be applied to HNSW. We haven't tested it, but I believe it won't be significantly faster than using IVF. Fastscan requires packing 32 compressed vectors together, which isn't feasible with HNSW. The main time cost will lie in the reranking process. Since the quantization representations in IVF and HNSW are similar, the rerank computation cost will also be comparable. Therefore, the final performance difference is likely to be minimal. As the index, IVF is much simpler and more efficient than HNSW, particularly for insertion and deletion. In HNSW, these operations require extensive computation and write amplification due to the cascading modifications throughout the entire graph. In contrast, IVF only involves inserting or deleting from a posting list. Here's some early benchmark results on GIST 960-dim dataset with 1M vectors. The red one is pgvector HNSW and blue one is vectorchord. We'll post more about it later. Elasticsearch have written two good blogs about how the RaBitQ works. link link |
We plan to suggest users migrate to vchord at the end of this year when it becomes more mature. The vchord implementation is more compatible with pgvector, which should reduce issues with cnpg and upgrade problems. And it's fully compatible with pgvector (and based on pgvector actually for the vector types), so immich user can choose from the vendor they like easier I think. |
That sounds very promising! An important aspect for Immich is the filtering capability. Is vchord the same in this regard as pgvecto.rs? |
@mertalev Yes. VBASE also works on vectorchord out of the box. |
Based on this, I have another question, more generally about IVF vs HNSW: A pain point with the current HNSW indices is that duplicate embeddings in the index make the recall drop considerably. This can happen when a row is updated and a new tuple with the same embedding is inserted into the index. It can also happen when someone increases the face detection threshold which removes face embeddings, then changes it back to add the same face embeddings again. I've restructured the data model and try to do a diff comparison before DB updates to minimize duplicate entries, but I'm wondering if this is less of an issue with an IVF index since updates and deletes are less invasive. |
Restructure may not help due to MVCC in postgres. If you store the embedding with other columns, even if you don't update the embedding, just update other columns, it will still insert the same embedding into the index again, because of MVCC. So you might need to move everything out of the embedding table and use join for such scenario. And for your question, I don't think IVF will have the problem you describe. Duplicate vectors just means multiple same entries will be retrieved at the quantization stage, and might result in multiple rerank, but won't have any effect on recall. |
And for the typical use case of immich (less than 100k vectors), I think it also works with quantization only (set nlist to 1). It will only use 100000/32/2=1562 pages=12MB (let's say for 768dim vec) for the quantized vectors. The memory requirements will also be lower. |
Yes, I changed it so that embeddings go into their own separate table.
Thanks for clarifying! That's good to hear.
Interesting idea. I think a challenge there is that we don't know how many assets a user will have beforehand. We could possibly start with this and occasionally check if the number of tuples in the index is getting too high, then reindex with different settings if that's the case? |
This is a potential limitation of vchordrq index. It currently doesn't allow us to dynamically change nlist based on the user's data growth. However, for immich's workload, I believe nlist=1 can accommodate up to 1M vectors. It may be slightly slower, but it will still be less than 100 ms for latency I think. We can do a simple experiments with 1M 960dim GIST data as the performance reference. What's the typical vector dimension in immich? |
The dimension size is typically 512 - the default CLIP model is 512 and the facial recognition model is always 512. I believe the CLIP model with the highest dimension size we support is 1152, but most people don't change the model. |
Hi! like @mertalev I am working also in a search engine using pgvecto.rs. My concern is not only providing search results but also search information (number of results and facets information). This is basically a range search ( I have more information of my approach here My question is if this new VectorChord extension (with the new IVF index) is going to inprove the range vector search because in my experience this type of seach is orders of magnitude (x1.000 or even x10.000) slower than topk search (using HSNW in both cases). I have been reading about search iterators and maybe this could be a solution. |
I'm sure they can go in more detail, but a few thoughts from me. Range queries are supported with this syntax: SELECT val0
FROM t
WHERE val0 <<->> sphere('[0, 0, 0]'::vector, 1)
ORDER BY val0 <-> '[0, 0, 0]'; I haven't taken advantage of this yet, but it's supposed to be much more efficient than the syntax you shared. Normal CTEs are also folded into the overall query in Postgres. If you expect the CTE to run to completion and have everything derive from that result, you should use the Counting and grouping is also quite slow in Postgres when there are many rows, so I suggest trying to avoid that if possible. Maybe you can cache this in a materialized view or such? It's probably not a concern for a few thousand rows though. |
I think the problem is retriveing many results (vector range search with a low threashold). I can not change this because i want to offer both the total number of results ( Postgre automatically do At the end i want to find a fast way to find N vectors (range search) and counting them and do faceted search in postgre Example of real case of range search retrieving 100.431 results (found images) source |
@javiabellan Can I see your full SQL? Sometimes the query plan won't go through index unless rewrite it in a different way. |
Here is my query ( EXPLAIN ANALYZE
WITH cte AS
(
SELECT *
FROM documents
-- WHERE doc_emb <#> $1 < 0.2 -- BEFORE (NOT USING HNSW index)
WHERE doc_emb <<#>> sphere($1, 0.2) -- NOW (USING HNSW index, but much slower than topk search)
)
(
-- Sort results and then paginate
SELECT
'doc_result' AS Section,
json_build_object( 'id', id, 'src', src ) AS JSON_Value
FROM cte
ORDER BY doc_emb <#> $1 asc
LIMIT 5
OFFSET 0
)
UNION ALL
(
-- Count number of results
SELECT
'num_documents' AS Section,
json_build_object('count', COUNT(*)) AS JSON_Value
FROM cte
)
UNION ALL
(
-- faceted search info of author field
SELECT
'author' AS Section,
json_build_object(
'value', author,
'count', COUNT(*)
) AS JSON_Value
FROM cte
GROUP BY author
ORDER BY COUNT(*) DESC
LIMIT 20
)
-- more faceted search fields can be inserted here...
; |
@javiabellan Can you combine these two part together? like
|
@javiabellan Or what you actually want it is |
Hi! I noticed this project and that it's positioned as a successor to pgvecto.rs. I'm interested to learn more about how it compares in terms of performance and recall.
I also see that it only supports IVF. Doesn't IVF normally offer a worse recall/speed tradeoff than HNSW?
Lastly, what does this mean for pgvecto.rs? Will it continue to be supported, or are users expected to migrate to vchord?
The text was updated successfully, but these errors were encountered: