Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with pgvecto.rs #77

Open
mertalev opened this issue Nov 13, 2024 · 18 comments
Open

Comparison with pgvecto.rs #77

mertalev opened this issue Nov 13, 2024 · 18 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@mertalev
Copy link

mertalev commented Nov 13, 2024

Hi! I noticed this project and that it's positioned as a successor to pgvecto.rs. I'm interested to learn more about how it compares in terms of performance and recall.

I also see that it only supports IVF. Doesn't IVF normally offer a worse recall/speed tradeoff than HNSW?

Lastly, what does this mean for pgvecto.rs? Will it continue to be supported, or are users expected to migrate to vchord?

@mertalev mertalev changed the title Comparison with pgvector.rs Comparison with pgvecto.rs Nov 13, 2024
@gaocegege gaocegege added question Further information is requested help wanted Extra attention is needed labels Nov 14, 2024
@gaocegege
Copy link
Member

Thank you for pointing out the issue. Our README doesn't provide detailed explanations, and we will work on improving it. Regarding IVF and HNSW, I believe they should also be included in the README. What do you think, @VoVAllen?

@VoVAllen
Copy link
Member

VoVAllen commented Nov 14, 2024

We open sources in a hurry and will write more blogs about implementation details and performance comparisons in the coming weeks.

Why IVF can be faster than HNSW is a bit long story. But I'll cover the essentials here.

Most of the time spent by vector search algorithms is on distance computation. To improve speed, we need to minimize distance comparisons as much as possible. The original IVF performs poorly in this regard, typically requiring a scan of 1% to 5% of the total vectors, which is significantly more than HNSW.

Quantization introduces a new approach, allowing you to compress a 32-bit vector into just 1 bit (like RaBitQ) or even less (like PQ in Faiss). While this compression results in some loss of precision, it greatly reduces computational requirements. With the fast scan optimization, we can achieve computations that are over 32 times faster than traditional vector distance calculations. We can then rerank additional vectors to achieve a better recall rate. The full precision distance computation is only necessary during the reranking phase. This is why IVF can be faster than HNSW.

The same technique can also be applied to HNSW. We haven't tested it, but I believe it won't be significantly faster than using IVF. Fastscan requires packing 32 compressed vectors together, which isn't feasible with HNSW. The main time cost will lie in the reranking process. Since the quantization representations in IVF and HNSW are similar, the rerank computation cost will also be comparable. Therefore, the final performance difference is likely to be minimal.

As the index, IVF is much simpler and more efficient than HNSW, particularly for insertion and deletion. In HNSW, these operations require extensive computation and write amplification due to the cascading modifications throughout the entire graph. In contrast, IVF only involves inserting or deleting from a posting list.

Here's some early benchmark results on GIST 960-dim dataset with 1M vectors. The red one is pgvector HNSW and blue one is vectorchord. We'll post more about it later.

img_v3_02ft_5f500434-72e8-4cb7-b132-b22b646cbecg

Elasticsearch have written two good blogs about how the RaBitQ works. link link

@VoVAllen
Copy link
Member

VoVAllen commented Nov 14, 2024

We plan to suggest users migrate to vchord at the end of this year when it becomes more mature. The vchord implementation is more compatible with pgvector, which should reduce issues with cnpg and upgrade problems. And it's fully compatible with pgvector (and based on pgvector actually for the vector types), so immich user can choose from the vendor they like easier I think.

@mertalev
Copy link
Author

That sounds very promising! An important aspect for Immich is the filtering capability. Is vchord the same in this regard as pgvecto.rs?

@VoVAllen
Copy link
Member

@mertalev Yes. VBASE also works on vectorchord out of the box.

@mertalev
Copy link
Author

mertalev commented Nov 16, 2024

As the index, IVF is much simpler and more efficient than HNSW, particularly for insertion and deletion. In HNSW, these operations require extensive computation and write amplification due to the cascading modifications throughout the entire graph. In contrast, IVF only involves inserting or deleting from a posting list.

Based on this, I have another question, more generally about IVF vs HNSW:

A pain point with the current HNSW indices is that duplicate embeddings in the index make the recall drop considerably. This can happen when a row is updated and a new tuple with the same embedding is inserted into the index. It can also happen when someone increases the face detection threshold which removes face embeddings, then changes it back to add the same face embeddings again.

I've restructured the data model and try to do a diff comparison before DB updates to minimize duplicate entries, but I'm wondering if this is less of an issue with an IVF index since updates and deletes are less invasive.

@VoVAllen
Copy link
Member

Restructure may not help due to MVCC in postgres. If you store the embedding with other columns, even if you don't update the embedding, just update other columns, it will still insert the same embedding into the index again, because of MVCC. So you might need to move everything out of the embedding table and use join for such scenario.

And for your question, I don't think IVF will have the problem you describe. Duplicate vectors just means multiple same entries will be retrieved at the quantization stage, and might result in multiple rerank, but won't have any effect on recall.

@VoVAllen
Copy link
Member

And for the typical use case of immich (less than 100k vectors), I think it also works with quantization only (set nlist to 1). It will only use 100000/32/2=1562 pages=12MB (let's say for 768dim vec) for the quantized vectors. The memory requirements will also be lower.

@mertalev
Copy link
Author

So you might need to move everything out of the embedding table and use join for such scenario.

Yes, I changed it so that embeddings go into their own separate table.

And for your question, I don't think IVF will have the problem you describe. Duplicate vectors just means multiple same entries will be retrieved at the quantization stage, and might result in multiple rerank, but won't have any effect on recall.

Thanks for clarifying! That's good to hear.

And for the typical use case of immich (less than 100k vectors), I think it also works with quantization only (set nlist to 1). It will only use 100000/32/2=1562 pages=12MB (let's say for 768dim vec) for the quantized vectors. The memory requirements will also be lower.

Interesting idea. I think a challenge there is that we don't know how many assets a user will have beforehand. We could possibly start with this and occasionally check if the number of tuples in the index is getting too high, then reindex with different settings if that's the case?

@VoVAllen
Copy link
Member

This is a potential limitation of vchordrq index. It currently doesn't allow us to dynamically change nlist based on the user's data growth. However, for immich's workload, I believe nlist=1 can accommodate up to 1M vectors. It may be slightly slower, but it will still be less than 100 ms for latency I think. We can do a simple experiments with 1M 960dim GIST data as the performance reference. What's the typical vector dimension in immich?

@mertalev
Copy link
Author

mertalev commented Nov 17, 2024

The dimension size is typically 512 - the default CLIP model is 512 and the facial recognition model is always 512. I believe the CLIP model with the highest dimension size we support is 1152, but most people don't change the model.

@javiabellan
Copy link

Hi! like @mertalev I am working also in a search engine using pgvecto.rs.

My concern is not only providing search results but also search information (number of results and facets information). This is basically a range search (WHERE embedding <#> '[3,1,2]' < threashold) and store that information in a CTE. Then use that CTE to sort and paginate the results,UNION ALL with the search result counting (count(*)) and UNION ALL with the optional facets information. Here is a diagram:

Captura de pantalla 2024-12-16 a las 19 41 48

I have more information of my approach here

My question is if this new VectorChord extension (with the new IVF index) is going to inprove the range vector search because in my experience this type of seach is orders of magnitude (x1.000 or even x10.000) slower than topk search (using HSNW in both cases). I have been reading about search iterators and maybe this could be a solution.

@mertalev
Copy link
Author

mertalev commented Dec 16, 2024

I'm sure they can go in more detail, but a few thoughts from me.

Range queries are supported with this syntax:

SELECT val0
FROM t
WHERE val0 <<->> sphere('[0, 0, 0]'::vector, 1)
ORDER BY val0 <-> '[0, 0, 0]';

I haven't taken advantage of this yet, but it's supposed to be much more efficient than the syntax you shared.

Normal CTEs are also folded into the overall query in Postgres. If you expect the CTE to run to completion and have everything derive from that result, you should use the MATERIALIZED keyword for that behavior. You'll have to do testing to see which plan works better.

Counting and grouping is also quite slow in Postgres when there are many rows, so I suggest trying to avoid that if possible. Maybe you can cache this in a materialized view or such? It's probably not a concern for a few thousand rows though.

@javiabellan
Copy link

javiabellan commented Dec 17, 2024

I think the problem is retriveing many results (vector range search with a low threashold). I can not change this because i want to offer both the total number of results (COUNT(*)) and the faceted search information (GROUP BY + COUNT)

Postgre automatically do MATERIALIZED because detects that CTE is used for differnent purposes. (This can be disabled by writting NOT MATERIALIZED).

At the end i want to find a fast way to find N vectors (range search) and counting them and do faceted search in postgre

Example of real case of range search retrieving 100.431 results (found images) source

Captura de pantalla 2024-12-17 a las 12 40 33

@VoVAllen
Copy link
Member

@javiabellan Can I see your full SQL? Sometimes the query plan won't go through index unless rewrite it in a different way.

@javiabellan
Copy link

javiabellan commented Dec 17, 2024

Here is my query ($1 is the query vector passed to postgre as binary argument):

EXPLAIN ANALYZE
WITH cte AS
(
    SELECT *
    FROM documents
    -- WHERE doc_emb <#> $1 < 0.2       -- BEFORE (NOT USING HNSW index)
    WHERE doc_emb <<#>> sphere($1, 0.2) -- NOW (USING HNSW index, but much slower than topk search)
)
(
    -- Sort results and then paginate
    SELECT
        'doc_result' AS Section,
        json_build_object( 'id', id, 'src', src ) AS JSON_Value
    FROM cte
    ORDER BY doc_emb <#> $1 asc
    LIMIT 5
    OFFSET 0
)
UNION ALL
(
    -- Count number of results
    SELECT
        'num_documents' AS Section,
        json_build_object('count', COUNT(*)) AS JSON_Value
    FROM cte
)
UNION ALL
(
    -- faceted search info of author field
    SELECT
        'author' AS Section,
        json_build_object(
            'value', author,
            'count', COUNT(*)
        ) AS JSON_Value
    FROM cte
    GROUP BY author
    ORDER BY COUNT(*) DESC
    LIMIT 20
)
-- more faceted search fields can be inserted here...
;

@VoVAllen
Copy link
Member

VoVAllen commented Dec 17, 2024

@javiabellan Can you combine these two part together? like SELECT * FROM documents WHERE doc_emb <<#>> sphere($1, 0.2) ORDER BY doc_emb <#> $1 LIMIT 5 OFFSET 0

WITH cte AS
(
    SELECT *
    FROM documents
    -- WHERE doc_emb <#> $1 < 0.2       -- BEFORE (NOT USING HNSW index)
    WHERE doc_emb <<#>> sphere($1, 0.2) -- NOW (USING HNSW index, but much slower than topk search)
)
(
    -- Sort results and then paginate
    SELECT
        'doc_result' AS Section,
        json_build_object( 'id', id, 'src', src ) AS JSON_Value
    FROM cte
    ORDER BY doc_emb <#> $1 asc
    LIMIT 5
    OFFSET 0
)`

There might be too many vectors for `WHERE doc_emb <<#>> sphere($1, 0.2)` thus it will be slow.

@VoVAllen
Copy link
Member

@javiabellan Or what you actually want it is SELECT * FROM documents WHERE doc_emb <#> $1 < 0.2 ORDER BY doc_emb <#> $1 LIMIT 5 OFFSET 0 and this will go through the index also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants