-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
don't train or untrain empty word hashes #132
Conversation
@@ -73,7 +73,9 @@ def train(category, text) | |||
|
|||
@backend.update_category_training_count(category, 1) | |||
@backend.update_total_trainings(1) | |||
Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count| | |||
word_hash = Hasher.word_hash(text, @language, @enable_stemmer) | |||
return if word_hash.length == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make it the very first thing to happen in the method. This current change is essentially changing nothing significantly. update_category_training_count
and update_total_trainings
are still being called, hence causing undesired stats changes. Also, if the @auto_categorize
is set to true
, it might add a new category without any trainings in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to preempt add_category
or just update_category_training_count
and update_total_trainings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make it the very first thing in the method, even before calling CategoryNamer
. This way, if the training has to be skipped, it does not make any changes in the data-structure state in any circumstances.
@@ -91,7 +93,9 @@ def untrain(category, text) | |||
category = CategoryNamer.prepare_name(category) | |||
@backend.update_category_training_count(category, -1) | |||
@backend.update_total_trainings(-1) | |||
Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count| | |||
word_hash = Hasher.word_hash(text, @language, @enable_stemmer) | |||
return if word_hash.length == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this up in the method as well.
Also, using |
Additionally, the def classifications(text)
score = {}
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
if word_hash.empty
category_keys.each do |category|
score[category.to_s] = Float::INFINITY
end
return score
end
#... remaining code
end |
I had these changes in a local branch on my code base, but since you are taking care of it, I will just discard that. |
So I made these changes, but it breaks this test . 10.times do |a_number|
assert_equal 'Digit', b.classify(a_number.to_s)
end But, I'm not sure it even makes sense to classify a single digit that way. |
# now add prior probability for the category | ||
s = @backend.category_has_trainings?(category) ? @backend.category_training_count(category) : 0.1 | ||
score[category.to_s] += Math.log(s / @backend.total_trainings.to_f) | ||
score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you did not used return
keyword in the if
section of the word_hash.empty?
clause and wrapped the remaining code in the else
clause, I would encourage to remove implicit return of score
from both the clauses and return it outside the conditional as the last line.
Alternatively, explicit return in the special case and not wrapping rest of the code in else
clause would reduce a level of nesting, while gaining the same effect.
This is happening because how the tokenization is implemented currently. If the length of the token is not greater than 2 letters then it is skipped and not added in the In the new implementation of the next unless word.length > 2 && !STOPWORDS[language].include?(word) In conclusion, we need to change the test so that it trains for numbers above 100. This should make the test pass. Then we need to think if we want the limit of length where the token is nothing but a number such as |
I changed the test to use words, which I think better reflects the intent, and I tweaked the threshold. But in general I agree that it was a bad test, we seem to have a lot of those. Yeah, let's prioritize #131, it will be pretty useful going forward. |
@ibnesayeed any insight into why The results of |
You might want to make it something like |
@ibnesayeed thoughts on the failing test? |
I am trying to look into it. If it was failing for all Rubies, I would have though there might be some edge condition on how numbers are stored in Redis. |
It might be an issue in the Redis Ruby gem |
I think the issue might be on another level. Because even the expected value changes of each run. That expected value is calculated by generating a hash digest of the array of the results and scores concatenated. If the array is somehow changing the order or the digest calculation is not reliable to test the equality then we might need to make some changes there. |
I think I got the culprit, just need to dig deeper and see how it goes. |
@ibnesayeed good catch |
The issue is related to how def classify_with_score(text)
(classifications(text).sort_by { |a| -a[1] })[0]
end This returns the first value with the highest score. However, if more than one categories have the same score, how to resolve the conflict? In this particular test case, there is an instance in the dataset on like 4499, there is a record |
If the test uses a threshold, the situation will be little different as discussed in #123. |
However, should have a fix for this situation without making any changes in the library code, only in the test. I will let you know those changes in a bit. |
awesome, thanks! |
Please cherry pick ibnesayeed@ed116ea |
I was in rush and could not make proper PR on time. PR #133 should take care of the error. |
It was interesting to find out that the record that was nothing but a stopword was paced at line 4499. We are using records 1-4000 for training and 4001-5000 for testing and it happened to be the second last record on the test set. This was coincidental, but a good one that made the test more reliable. |
yeah, that was a good catch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.