Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support for metric='precomputed' #5

Open
RichieHakim opened this issue May 7, 2023 · 3 comments
Open

Feature request: Support for metric='precomputed' #5

RichieHakim opened this issue May 7, 2023 · 3 comments

Comments

@RichieHakim
Copy link

I'm currently using vanilla HDBSCAN to cluster a precomputed sparse distance matrix being input as a scipy.sparse.csr_matrix object. I'm very eager to use fast_hdbscan due primarily to it's easier compilation requirements as I'm attempting to ship out a tool that uses hdbscan as a step in a pipeline.

Currently, I believe clustering on precomputed sparse distance matrices is not supported in fast_hdbscan. I think it would require the porting of some of the following functions:

  • hdbscan_._hdbscan_sparse_distance_matrix
  • _hdbscan_reachability.sparse_mutual_reachability
  • _hdbscan_linkage.label

Unfortunately, I don't think I'm able to figure out how to implement this one myself. Though, I'm happy to help out in testing any PRs with basic implementations.
Thank you for great package and I really hope I'll be able to use it soon!

@lmcinnes
Copy link
Contributor

lmcinnes commented May 7, 2023 via email

@RichieHakim
Copy link
Author

RichieHakim commented May 31, 2023

Thank you so much for looking into this. I am very motivated to help if you think it's possible to delegate anything. For what it's worth, this is how hdbscan is being used in the project I'm working on: https://github.com/RichieHakim/ROICaT/blob/dev/roicat/tracking/clustering.py#L420

Perhaps bringing up the tricks/hacks that are being used to get desired behavior would be of interest. 1) I'm using a very custom sparse distance matrix as input. 2) Since the graph has multiple disjointed components, I need to add a fully connected node before clustering. 3) Since there are sample pairs that are known to be disconnected a priori, clusters containing these pairs ('pair violations') are split up by walking down the cutting distance until the pair violations are gone.

Playing with the max_dist doesn't help much here. Single linkage is a blessing and a curse it seems. If there was a way for the MST to be blind to any sample that would cause a violation as the tree is built up, that would be of significant utility for tracking software.

Thanks again, I'm a big fan of all your projects.

@RichieHakim
Copy link
Author

@lmcinnes
bumping this based on this conversation: scikit-learn-contrib/hdbscan#299.

I will look into existing semi-supervised methods for vanilla HDBSCAN, and I will look into approaches to recover / convert to embedding vectors from sparse distance matrices so that we can try fast_hdbscan. If there is a way to achieve both in one library, we are very interested. Please let me know if either would benefit from further conversation or resources. Thanks again for these amazing resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants