You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! I am insterest in the performance. Could you provide the performance that also considering copy queries to the device and copy results to the host?
Have you benchmarked it on other GPU?
The text was updated successfully, but these errors were encountered:
Due to several reasons, I sadly can't answer this as precisely as you might like.
On the one hand, there are actually several kernels involved in the graph construction each with different register and shared memory usage.
Additionally, this depends on the dataset the code is configured for.
Generally, datasets with larger vectors will require more registers for distance computations.
Further, the cache which handles the best list, priority queue and visited list is configurable and the shared memory usage depends on its size.
For some datasets such as SIFT1M, often smaller caches suffice to achieve high recall, where its in the area of 2kB of shared memory whereas for other datasets such as NyTimes (see ann-benchmarks),
we need a visited list with around 2000 entries to achieve 99% recall @ 1 and thus need just above 8kB of shared memory.
We generally try to tune it such that we achieve high occupancy.
If you just compile the code, the PTX assembler will print the details to stderr.
We're currently working on a revision of the code and the paper and will probably update this repository once we submit the revision.
The new code should be a bit more understandable and easier to use than the current version and supports cosine similarity in addition to euclidean distances.
We're also running the new tests on V100s where we're seeing slight improvements over the Titan RTX.
Concerning copying queries to the device and copying results back to the host I can tell you the following:
It certainly makes a small impact in query performance if the graph is not yet migrated to the device.
For the queries, that should be negligible since we're running individual queries per thread block, so each thread just needs to load a single vector.
This should not add much overhead compared to the overhead for launching the kernel.
Similarly, the result consist of just a few indices which can be copied back quickly, but I don't have any precise numbers for you at the moment.
Hello! I am insterest in the performance. Could you provide the performance that also considering copy queries to the device and copy results to the host?
Have you benchmarked it on other GPU?
The text was updated successfully, but these errors were encountered: