clustering ideas #364

priamai · 2021-07-15T17:00:09Z

priamai
Jul 15, 2021

Hi there, I wonder whether would be possible to write an application to cluster documents as described in this old article .

It describes how k-means, k-medoid and k-means fast can be implemented via cosine similarity, jaccard coefficient and correlation coefficient. I believe would be trivial to also add euclidean similarity?

Answered by alexklibisz

Jul 16, 2021

You can't really do clustering with only Elasticsearch/Elastiknn; you need some other system as well.

Elasticsearch (and therefore also Elastiknn) is designed around a request-response model, dealing with a single document to index or a single query to execute. So you make a request and get the answer out of the cluster and back to the user as fast as possible. This means there's really no place to execute long-running iterative clustering techniques like K-means.

What you could do is this: run the clustering algorithm in some other system and use Elasticsearch/Elastiknn to store and retrieve them. So you basically start by running K-means for some representative subset of the document ve…

View full answer

alexklibisz · 2021-07-16T01:41:30Z

alexklibisz
Jul 16, 2021
Maintainer

You can't really do clustering with only Elasticsearch/Elastiknn; you need some other system as well.

Elasticsearch (and therefore also Elastiknn) is designed around a request-response model, dealing with a single document to index or a single query to execute. So you make a request and get the answer out of the cluster and back to the user as fast as possible. This means there's really no place to execute long-running iterative clustering techniques like K-means.

What you could do is this: run the clustering algorithm in some other system and use Elasticsearch/Elastiknn to store and retrieve them. So you basically start by running K-means for some representative subset of the document vectors, say in Python using scikit-learn. This gives you a smaller set of centroid vectors, one per cluster. You index all of the document vectors in Elasticsearch/Elastiknn and then use those centroid vectors to retrieve clusters of the document vectors (e.g., 100 nearest neighbors of a centroid vector is a cluster). You could also index the centroid vectors in another index and reference them in your queries against the document vectors.

If you have more follow-up questions, let's open a discussion: https://github.com/alexklibisz/elastiknn/discussions/new

0 replies

priamai · 2021-07-16T12:13:08Z

priamai
Jul 16, 2021
Author

Sure you are spot on, I did see some ES people attempting to produce an aggregator for k-means but never manage to complete, there are certain features missing in the ES engine for achieving that.

What I will try is the following: use Dask to load the ES documents (with the vectors) in parallel, run the k-means or k-medoids effcient implementations and then update the elements with a cluster id field.
This way at query time the user can retrieve the entire cluster.
I was reading also about a new database pattern called "feature databases" where basically a database is designed to store ML features to be able to run ML jobs.
In this case means ES +EKNN fields are my feature database.

I will post some examples once is working.
Cheers!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering ideas #364

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

clustering ideas #364

priamai Jul 15, 2021

Replies: 2 comments

alexklibisz Jul 16, 2021 Maintainer

priamai Jul 16, 2021 Author

priamai
Jul 15, 2021

alexklibisz
Jul 16, 2021
Maintainer

priamai
Jul 16, 2021
Author