clustering ideas #364
-
Hi there, I wonder whether would be possible to write an application to cluster documents as described in this old article . It describes how k-means, k-medoid and k-means fast can be implemented via cosine similarity, jaccard coefficient and correlation coefficient. I believe would be trivial to also add euclidean similarity? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
You can't really do clustering with only Elasticsearch/Elastiknn; you need some other system as well. Elasticsearch (and therefore also Elastiknn) is designed around a request-response model, dealing with a single document to index or a single query to execute. So you make a request and get the answer out of the cluster and back to the user as fast as possible. This means there's really no place to execute long-running iterative clustering techniques like K-means. What you could do is this: run the clustering algorithm in some other system and use Elasticsearch/Elastiknn to store and retrieve them. So you basically start by running K-means for some representative subset of the document vectors, say in Python using scikit-learn. This gives you a smaller set of centroid vectors, one per cluster. You index all of the document vectors in Elasticsearch/Elastiknn and then use those centroid vectors to retrieve clusters of the document vectors (e.g., 100 nearest neighbors of a centroid vector is a cluster). You could also index the centroid vectors in another index and reference them in your queries against the document vectors. If you have more follow-up questions, let's open a discussion: https://github.com/alexklibisz/elastiknn/discussions/new |
Beta Was this translation helpful? Give feedback.
-
Sure you are spot on, I did see some ES people attempting to produce an aggregator for k-means but never manage to complete, there are certain features missing in the ES engine for achieving that. What I will try is the following: use Dask to load the ES documents (with the vectors) in parallel, run the k-means or k-medoids effcient implementations and then update the elements with a cluster id field. I will post some examples once is working. |
Beta Was this translation helpful? Give feedback.
You can't really do clustering with only Elasticsearch/Elastiknn; you need some other system as well.
Elasticsearch (and therefore also Elastiknn) is designed around a request-response model, dealing with a single document to index or a single query to execute. So you make a request and get the answer out of the cluster and back to the user as fast as possible. This means there's really no place to execute long-running iterative clustering techniques like K-means.
What you could do is this: run the clustering algorithm in some other system and use Elasticsearch/Elastiknn to store and retrieve them. So you basically start by running K-means for some representative subset of the document ve…