How many shared memory does this kernel use ? #3

goodluckcwl · 2020-12-22T06:44:47Z

Hello! I am insterest in the performance. Could you provide the performance that also considering copy queries to the device and copy results to the host?
Have you benchmarked it on other GPU?

LukasRuppert · 2021-01-08T09:04:34Z

Due to several reasons, I sadly can't answer this as precisely as you might like.
On the one hand, there are actually several kernels involved in the graph construction each with different register and shared memory usage.
Additionally, this depends on the dataset the code is configured for.
Generally, datasets with larger vectors will require more registers for distance computations.
Further, the cache which handles the best list, priority queue and visited list is configurable and the shared memory usage depends on its size.
For some datasets such as SIFT1M, often smaller caches suffice to achieve high recall, where its in the area of 2kB of shared memory whereas for other datasets such as NyTimes (see ann-benchmarks),
we need a visited list with around 2000 entries to achieve 99% recall @ 1 and thus need just above 8kB of shared memory.
We generally try to tune it such that we achieve high occupancy.
If you just compile the code, the PTX assembler will print the details to stderr.

We're currently working on a revision of the code and the paper and will probably update this repository once we submit the revision.
The new code should be a bit more understandable and easier to use than the current version and supports cosine similarity in addition to euclidean distances.
We're also running the new tests on V100s where we're seeing slight improvements over the Titan RTX.

Concerning copying queries to the device and copying results back to the host I can tell you the following:
It certainly makes a small impact in query performance if the graph is not yet migrated to the device.
For the queries, that should be negligible since we're running individual queries per thread block, so each thread just needs to load a single vector.
This should not add much overhead compared to the overhead for launching the kernel.
Similarly, the result consist of just a few indices which can be copied back quickly, but I don't have any precise numbers for you at the moment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many shared memory does this kernel use ? #3

How many shared memory does this kernel use ? #3

goodluckcwl commented Dec 22, 2020 •

edited

Loading

LukasRuppert commented Jan 8, 2021

How many shared memory does this kernel use ? #3

How many shared memory does this kernel use ? #3

Comments

goodluckcwl commented Dec 22, 2020 • edited Loading

LukasRuppert commented Jan 8, 2021

goodluckcwl commented Dec 22, 2020 •

edited

Loading