Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How many shared memory does this kernel use ? #3

Open
goodluckcwl opened this issue Dec 22, 2020 · 1 comment
Open

How many shared memory does this kernel use ? #3

goodluckcwl opened this issue Dec 22, 2020 · 1 comment

Comments

@goodluckcwl
Copy link

goodluckcwl commented Dec 22, 2020

Hello! I am insterest in the performance. Could you provide the performance that also considering copy queries to the device and copy results to the host?
Have you benchmarked it on other GPU?

@LukasRuppert
Copy link
Collaborator

Due to several reasons, I sadly can't answer this as precisely as you might like.
On the one hand, there are actually several kernels involved in the graph construction each with different register and shared memory usage.
Additionally, this depends on the dataset the code is configured for.
Generally, datasets with larger vectors will require more registers for distance computations.
Further, the cache which handles the best list, priority queue and visited list is configurable and the shared memory usage depends on its size.
For some datasets such as SIFT1M, often smaller caches suffice to achieve high recall, where its in the area of 2kB of shared memory whereas for other datasets such as NyTimes (see ann-benchmarks),
we need a visited list with around 2000 entries to achieve 99% recall @ 1 and thus need just above 8kB of shared memory.
We generally try to tune it such that we achieve high occupancy.
If you just compile the code, the PTX assembler will print the details to stderr.

We're currently working on a revision of the code and the paper and will probably update this repository once we submit the revision.
The new code should be a bit more understandable and easier to use than the current version and supports cosine similarity in addition to euclidean distances.
We're also running the new tests on V100s where we're seeing slight improvements over the Titan RTX.

Concerning copying queries to the device and copying results back to the host I can tell you the following:
It certainly makes a small impact in query performance if the graph is not yet migrated to the device.
For the queries, that should be negligible since we're running individual queries per thread block, so each thread just needs to load a single vector.
This should not add much overhead compared to the overhead for launching the kernel.
Similarly, the result consist of just a few indices which can be copied back quickly, but I don't have any precise numbers for you at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants