Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composite indexes and filtered searches #121

Open
DylanGriffith opened this issue Dec 5, 2024 · 10 comments
Open

Composite indexes and filtered searches #121

DylanGriffith opened this issue Dec 5, 2024 · 10 comments
Labels
type/discussion 🧵 type/question 🙋 Further information is requested

Comments

@DylanGriffith
Copy link

Problem

I was interested to understand how this extension performs when combining other filters (WHERE clauses) with vector searches. This is a common issue with vector search indexes and there is a long discussion about it in pgvector at pgvector/pgvector#259 .

I couldn't find an existing issue or mention of the topic in this repo except for this question #77 (comment) so I thought I'd open an issue.

Basically the idea would be to support queries like:

SELECT * FROM posts WHERE user_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5 

Typically in Postgres such queries are optimized with a composite index on user_id and embedding. Composite indexes are simple for a BTREE index but much more complicated for other index types.

Without a composite index Postgres query planner has to choose between loading all posts WHERE user_id = 123 and sorting those in memory and then limiting to the top 5 or it has to look at the entire ordered embedding <-> '[3,1,2]' list and keep searching until it finds 5 matches where user_id = 123. The choice usually depends on statistics which indicate which will be least wasteful. But in many applications both options will be very wasteful.

The worst case scenario happens when there are many users and every user has many posts. In that case the posts WHERE user_id = 123 needs to sort a lot of posts in memory and the embedding <-> '[3,1,2]' option would need to skip many irrelevant posts by different users.

Solution

This issue is mainly opened as a starting point to the conversation. There aren't easy answers here but usually only tradeoffs and I wanted to hear from the maintainers about their thoughts on this problem.

Related research

So far the most thorough discussion I've seen about this topic is in https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf but this has not yielded any production grade database nor a Postgres extension to my knowledge.

@gaocegege
Copy link
Member

We published a blog post about this feature in the release announcement for pgvecto.rs 0.2: https://blog.pgvecto.rs/pgvectors-02-unifying-relational-queries-and-vector-search-in-postgresql.

@gaocegege
Copy link
Member

Here are some discussions: tensorchord/pgvecto.rs#188

@gaocegege gaocegege added type/question 🙋 Further information is requested type/discussion 🧵 labels Dec 6, 2024
@VoVAllen
Copy link
Member

VoVAllen commented Dec 6, 2024

Thank you for your interest! There're two ways to do filtering with vector search, pre-filtering(check filter condition inside the index) and post-filtering(check filter condition outside the index).

For post-filtering inside postgres, it means the vector index is an iterator to provide candidates to postgres. Then postgres will check whether each element satisfies the filter condition. The technique to make the index an iterator is called VBASE. This is supported by VectorChord out of the box.

For pre-filtering inside postgres, it's more likely the composite index you mentioned. However, the composite index needs to be claimed in the indexing process, like CREATE INDEX test2_mm_idx ON test2 (major, minor);. This can help the vector index to skip the distance calculation if it is not needed. For IVF inside VectorChord, it's super easy to do this. We tried it in early development #20 and pgvector also had some experiments with it https://github.com/pgvector/pgvector/tree/ivfflat-filtering.

FreshDiskANN only works for a specific type of query (i.e. OR clause with tag selection) that doesn't fit postgres.

@DylanGriffith
Copy link
Author

This can help the vector index to skip the distance calculation if it is not needed. For IVF inside VectorChord, it's super easy to do this. We tried it in early development #20 and pgvector also had some experiments with it

@VoVAllen I think these examples still suffer from the inefficiency I described earlier where you may have to skip many results to find the LIMIT number of candidates. I can see this is definitely an optimization over the iterator approach where you are returning candidates and filtering them after doing the distance calculation. But the performance of this approach will break down massively in the common scenario where you have many "nearby" documents but none (or very few) of them match the filter.

For example in my SELECT * FROM posts WHERE user_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5 example, consider the scenario where the user has no posts at all. It will need to work it's way through the index finding every single document in order of embedding and skipping them one at a time to eventually give no results. And if the user has 5 posts but they are not the closest matches to the embedding then they may appear after many skipped posts. All of this means Postgres needs to read a lot of data from disk

My understanding from https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf is that FilteredDiskANN addresses this by using the filter value (e.g. user_id) as part of the distance calculation when constructing the index. So all the posts belonging to user_id will be near each other (in that case neighbours in the graph) such that traversing the index quickly finds all the posts by the user and doesn't need to skip as many irrelevant "close embeddings" that belongs to a different user.

This composite index approach is easier to implement for a filter like SELECT * FROM posts WHERE user_id = 123 ORDER BY created_at as a BTREE index because you can just index user_id, created_at and Postgres will go to the exact spot in the index for the user and the results will all be in order of created_at already.

Please correct me if I'm misunderstanding something though as I've never actually worked on Postgres index internals before.

@VoVAllen
Copy link
Member

VoVAllen commented Dec 6, 2024

@VoVAllen I think these examples still suffer from the inefficiency I described earlier where you may have to skip many results to find the LIMIT number of candidates. I can see this is definitely an optimization over the iterator approach where you are returning candidates and filtering them after doing the distance calculation. But the performance of this approach will break down massively in the common scenario where you have many "nearby" documents but none (or very few) of them match the filter.

You're right about the post-filtering approach. That's why I also mentioned the prefilter approach. However prefilter means you need to store the information to be filtered along with the vector. It will have extra costs on both storage and computation.

There's no simple composite index for multiple fields, that you cannot combine two indexes together.

For the query like SELECT * FROM posts WHERE user_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5, postgres will decide whether using the index on user_id or the vector index, based on the cost estimation. So if there's only few posts with user_id=123, go through user_id index will be faster.

FilteredDiskANN is specific for certain type of query, as I mentioned, the tag selection scenario. You can consider it as something like create partial index on each user_id. It will make the query like SELECT * FROM posts WHERE user_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5 much more efficient. But if you still want to query over all the index, like SELECT * FROM posts ORDER BY embedding <-> '[3,1,2]' LIMIT 5, it will be super inefficient since it need to aggregate results from all the user_id. It's a tradeoff here. As a general vector index type, we have to balance between the general cases and the specific cases.

This composite index approach is easier to implement for a filter like SELECT * FROM posts WHERE user_id = 123 ORDER BY created_at as a BTREE index because you can just index user_id, created_at and Postgres will go to the exact spot in the index for the user and the results will all be in order of created_at already.

The implementation is actually different from what you mentioned.

A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned. Constraints on columns to the right of these columns are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned. For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >= 42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up through the last entry with a = 5. Index entries with c >= 77 would be skipped, but they'd still have to be scanned through. This index could in principle be used for queries that have constraints on b and/or c with no constraint on a — but the entire index would have to be scanned, so in most cases the planner would prefer a sequential table scan over using the index.

Ref: https://www.postgresql.org/docs/current/indexes-multicolumn.html

The btree index is on the leading column. Other columns are just used to check the condition.

@DylanGriffith
Copy link
Author

DylanGriffith commented Dec 8, 2024

The btree index is on the leading column. Other columns are just used to check the condition

@VoVAllen I read that a little differently. My understanding is that a multi column BTREE index is actually ordering the data based on the composite order of all keys. What you are describing sounds a little more like "covering indexes" which are described at https://www.postgresql.org/docs/current/indexes-index-only-scans.html#INDEXES-INDEX-ONLY-SCANS . These are a way to include extra columns in the index without actually ordering by them. It allows doing the filtering without a heap fetch and I assume that's what you're referring to with "just used to check the condition".

The exact rule is that equality constraints on leading columns

Since "leading columns" is plural I believe it is talking about doing equality checks on all of the columns in order. Additionally they mention the point I was trying to make about combining sorting and the index to limit the amount of the index scanned.

the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned

Also that page doesn't talk about ORDER BY which is the main important detail I'm trying to highlight.

I'll try to illustrate the kind of benefits we get from a multi-column index when combining filtering and ordering with an example:

Example with/without BTREE indexes
CREATE TABLE posts (
  id bigserial PRIMARY KEY,
  user_id bigint,
  created_at timestamp
);

INSERT INTO posts (
  SELECT
  generate_series(1,100000) as id,
  floor(random()*100)+1 as user_id,
  NOW() + (random() * (NOW()+'100 days' - NOW())) as created_at
);

explain analyze select id, user_id, created_at from posts where user_id = 5 order by created_at limit 5;

create index on posts (user_id);

explain analyze select id, user_id, created_at from posts where user_id = 5 order by created_at limit 5;

create index on posts (user_id) include(created_at);

explain analyze select id, user_id, created_at from posts where user_id = 5 order by created_at limit 5;

create index on posts (user_id, created_at);

explain analyze select id, user_id, created_at from posts where user_id = 5 order by created_at limit 5;

CREATE TABLE
INSERT 0 100000
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1895.42..1895.43 rows=5 width=24) (actual time=3.900..3.901 rows=5 loops=1)
   ->  Sort  (cost=1895.42..1896.67 rows=500 width=24) (actual time=3.899..3.900 rows=5 loops=1)
         Sort Key: created_at
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on posts  (cost=0.00..1887.11 rows=500 width=24) (actual time=0.017..3.866 rows=1038 loops=1)
               Filter: (user_id = 5)
               Rows Removed by Filter: 98962
 Planning Time: 0.046 ms
 Execution Time: 3.907 ms
(9 rows)

CREATE INDEX
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=378.35..378.36 rows=5 width=24) (actual time=0.226..0.226 rows=5 loops=1)
   ->  Sort  (cost=378.35..379.60 rows=500 width=24) (actual time=0.225..0.225 rows=5 loops=1)
         Sort Key: created_at
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Index Scan using posts_user_id_idx on posts  (cost=0.29..370.04 rows=500 width=24) (actual time=0.008..0.195 rows=1038 loops=1)
               Index Cond: (user_id = 5)
 Planning Time: 0.113 ms
 Execution Time: 0.232 ms
(8 rows)

CREATE INDEX
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=378.35..378.36 rows=5 width=24) (actual time=0.229..0.230 rows=5 loops=1)
   ->  Sort  (cost=378.35..379.60 rows=500 width=24) (actual time=0.229..0.229 rows=5 loops=1)
         Sort Key: created_at
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Index Scan using posts_user_id_idx on posts  (cost=0.29..370.04 rows=500 width=24) (actual time=0.005..0.196 rows=1038 loops=1)
               Index Cond: (user_id = 5)
 Planning Time: 0.110 ms
 Execution Time: 0.236 ms
(8 rows)

CREATE INDEX
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..4.12 rows=5 width=24) (actual time=0.077..0.079 rows=5 loops=1)
   ->  Index Scan using posts_user_id_created_at_idx1 on posts  (cost=0.42..371.17 rows=500 width=24) (actual time=0.076..0.078 rows=5 loops=1)
         Index Cond: (user_id = 5)
 Planning Time: 0.097 ms
 Execution Time: 0.083 ms
(5 rows)

For the example query select id, user_id, created_at from posts where user_id = 5 order by created_at limit 5; we can see that:

  1. Without an index it needs to do a full table scan and then order all the matches for the condition user_id = 5
  2. With an index only on user_id it does an index scan to find all records matching user_id = 5 . This gives 1038 rows which then need to be passed up to the Sort node which sorts all 1038 rows before limiting to 5.
  3. With a covering index with user_id INCLUDE(created_at) it doesn't even get used or help with the query so it is the same as just the index on user_id
  4. With a multi column index on user_id, created_at it is able to just do an Index Scan which returns only 5 rows and doesn't need to do any sorting or filtering at all. It's around 20x faster than the index only on user_id and the multiplier gets better the more rows you have. This is not just because it avoids loading the heap tuples but also because it only needs to load 5 posts from the database because the first 5 posts it finds are already the first 5 in order of created_at

This indexing approach is not easy to map onto an inverted index or a graph index but from my reading of the paper https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf the FilteredDiskANN approach does try to accomplish something similar by ensuring that nearby graph nodes are also "nearby" based on the filters that you want to apply. So traversing the index (graph) yields you a set of results that match your filter with far fewer steps through the graph than a regular HNSW index.

I'm not sure how to relate that to IVF indexes but I'm trying to highlight the major performance opportunities we have if we can get an index that is actually organized by the filter conditions as well as the vector distances.

@DylanGriffith
Copy link
Author

DylanGriffith commented Dec 9, 2024

I think there are 2 simple approaches to expand an IVFFlat index with the ability to filter more efficiently. For the example I described of:

SELECT * FROM posts WHERE user_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5 

You could do:

  1. Have a separate embedding index per user_id or per range of user_id. This would likely not be suitable to any application with many users or more generally not suited to filters on columns with high cardinality. This may not actually require any changes if you just partitioned the table by user_id.
  2. Have the user_id included in the distance calculation when creating the lists at index time. Some weighting would likely need to be applied but one could make it so that index lists/clusters would more likely include a range of user_id . This approach would probably require careful tuning and depending on data cardinality it still may not be great. I'm also not sure if this is any better than the above approach but it's certainly more complicated.

@VoVAllen
Copy link
Member

VoVAllen commented Dec 9, 2024

Thank you. Your example makes a lot of sense to me. However, I think it only works if the leading operator is an equal constraint. If it's user_id < 100, I think it still needs to sort over all user_ids. For vector index, there are definitely some query types that can be further optimized, but I haven't seen how to do it in general yet. If you have a specific query type to optimize, we're happy to discuss and collaborate.

@DylanGriffith
Copy link
Author

I think it only works if the leading operator is an equal constraint. If it's user_id < 100, I think it still needs to sort over all user_ids

@VoVAllen yes I think you're right about that. The use case I'm mainly interested in is things like user_id = 123 ORDER BY embedding <-> '[3,1,2]' and user_id IN (123,456) ORDER BY embedding <-> '[3,1,2]'. Having an index specifically optimized to search a narrower number of documents before finding LIMIT number of documents matching the equality check would be awesome to see.

@VoVAllen
Copy link
Member

It's possible to make each posting list a btree organized by user_id to make it faster. But the code will be much more complex and I'm not sure how much speedup we will see. It depends a lot on how many users there are and how many posts there are for each user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/discussion 🧵 type/question 🙋 Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants