-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Other classification/nlp tools #88
Comments
@tra38 could you elaborate on which part(s) you're interested in? |
Sure. Since classifier-reborn already collects a bunch of data already, it makes sense to publicly expose the data that "classifier-reborn" gathered, so that a programmer can then feed that data into other gems that handle different classification/nlp tasks. For example, I've been clustering articles using classifier-reborn and kmeans-clusterer with the following code snippet: require 'classifier-reborn'
require 'kmeans-clusterer'
lsi = ClassifierReborn::LSI.new
strings = ["example string a", "example string b", "example string c"]
strings.each do |x|
lsi.add_item(x)
end
# Save transformed ClassiferReborn Content Nodes into new array
string_data = lsi.instance_variable_get(:"@items")
# Process the information for use in kmeans-clusterer
data = strings.map do |string|
string_data[string].lsi_norm.to_a
end
clusters = 13
kmeans = KMeansClusterer.run clusters, data, labels: strings, runs: 10 And obviously, it's kinda hacky to try to get the lsi_norm for each individual content node just so that you can then do some k-means clustering, which is why I gave you a "thumbs up" for considering exposing this data more directly. (And if I'm using some aspect of classifer-reborn strangely here, then some other programmer will use bags of words and word counts strangely as well. Expose all the data, trust the programmer.) |
@tra38 I think we could expose the lsi data. It'll probably take some careful refactoring, but should be doable. |
Will there be multiple classification? For example: given an input, classify it into more than one category. |
@Looooong not with bayes, that's not really how it works. |
You can get the raw score of each category against a given text in Bayes. This way you can decide to get top-K relevant categories, if that is what you are after. |
Should we also consider adding ruby-fann (Fast Artificial Neural Network). It wont be good for text data I guess, but for numeric stuff it would be great. |
@ibnesayeed Yes, I am planning to make multiple score with Bayes, but I guess it will take up a big amount of storage space. |
@Looooong it really depends on the amount of training data. Between Bayes and LSI the first one would take relatively less space. If you have huge amount of data then here are a few things you can do:
|
Since |
I agree. However, I would note one thing here that I encountered today while writing tests for stopwords. This needs to be instantiated and used as dependency injection during classifier initialization so that one classifier does not step over on the other's state due to the shared data. Currently, if in a single program, two classifiers are instantiated with different configuration, and one of them is making changes in the set of stopwords then the other classifier will also get affected. To overcome this issue in the tests I had to store the original stopwords in an instance variable in the setup method then restore that in the teardown method otherwise many other tests were failing. |
Yeah I noticed that. I think DI is the way to go then. |
In fact I had many other test cases in mind around stopwords that I could not put in place because they were seemingly very difficult (if not impossible). Similarly, some test cases could have been put together as part of assertions, but I had to separate them and duplicate most of the logic because of this stepping over behavior. |
We already do a bag of words, and word counts. Would it be useful to anyone to expose this functionality for other classification uses?
Some other things to consider:
The text was updated successfully, but these errors were encountered: