don't train or untrain empty word hashes #132

Ch4s3 · 2017-01-17T16:27:21Z

No description provided.

ibnesayeed · 2017-01-17T16:40:46Z

lib/classifier-reborn/bayes.rb

@@ -73,7 +73,9 @@ def train(category, text)

      @backend.update_category_training_count(category, 1)
      @backend.update_total_trainings(1)
-      Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count|
+      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
+      return if word_hash.length == 0


I would make it the very first thing to happen in the method. This current change is essentially changing nothing significantly. update_category_training_count and update_total_trainings are still being called, hence causing undesired stats changes. Also, if the @auto_categorize is set to true, it might add a new category without any trainings in it.

Do you want to preempt add_category or just update_category_training_count and update_total_trainings?

I would make it the very first thing in the method, even before calling CategoryNamer. This way, if the training has to be skipped, it does not make any changes in the data-structure state in any circumstances.

ibnesayeed · 2017-01-17T16:41:37Z

lib/classifier-reborn/bayes.rb

@@ -91,7 +93,9 @@ def untrain(category, text)
      category = CategoryNamer.prepare_name(category)
      @backend.update_category_training_count(category, -1)
      @backend.update_total_trainings(-1)
-      Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count|
+      word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
+      return if word_hash.length == 0


Move this up in the method as well.

ibnesayeed · 2017-01-17T16:52:53Z

Also, using word_hash.empty? instead of word_hash.length == 0 would be more elegant in my opinion.

ibnesayeed · 2017-01-17T17:00:38Z

Additionally, the classifications method should return a score hash with Infinity value for each category if the word_hash is empty like this:

def classifications(text)
  score = {}
  word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
  if word_hash.empty
    category_keys.each do |category|
      score[category.to_s] = Float::INFINITY
    end
    return score
  end
  #... remaining code
end

ibnesayeed · 2017-01-17T17:08:40Z

I had these changes in a local branch on my code base, but since you are taking care of it, I will just discard that.

Ch4s3 · 2017-01-17T17:12:12Z

So I made these changes, but it breaks this test .

10.times do |a_number|
    assert_equal 'Digit', b.classify(a_number.to_s)
end

But, I'm not sure it even makes sense to classify a single digit that way.

ibnesayeed · 2017-01-17T17:19:14Z

lib/classifier-reborn/bayes.rb

-        # now add prior probability for the category
-        s = @backend.category_has_trainings?(category) ? @backend.category_training_count(category) : 0.1
-        score[category.to_s] += Math.log(s / @backend.total_trainings.to_f)
+        score


Since you did not used return keyword in the if section of the word_hash.empty? clause and wrapped the remaining code in the else clause, I would encourage to remove implicit return of score from both the clauses and return it outside the conditional as the last line.

Alternatively, explicit return in the special case and not wrapping rest of the code in else clause would reduce a level of nesting, while gaining the same effect.

ibnesayeed · 2017-01-17T17:33:23Z

So I made these changes, but it breaks this test .

This is happening because how the tokenization is implemented currently. If the length of the token is not greater than 2 letters then it is skipped and not added in the word_hash. I think it was a meaningless test because there was only one category present and the system was returning that category with some score no matter what was passed to classify. Those categories were actually not being trained with anything.

In the new implementation of the classification method, with empty strings getting Infinity score, and a threshold is set, the classify method would return nil.

next unless word.length > 2 && !STOPWORDS[language].include?(word)

In conclusion, we need to change the test so that it trains for numbers above 100. This should make the test pass. Then we need to think if we want the limit of length where the token is nothing but a number such as 2, 20, or .5. This makes it more important to be able to pass a custom tokenizer (#131) for situations like this.

Ch4s3 · 2017-01-17T17:35:59Z

I changed the test to use words, which I think better reflects the intent, and I tweaked the threshold. But in general I agree that it was a bad test, we seem to have a lot of those.

Yeah, let's prioritize #131, it will be pretty useful going forward.

Ch4s3 · 2017-01-17T17:53:19Z

@ibnesayeed any insight into why bayesian_integration_test.rb:35 fails on newer rubies with this PR?

The results of classification_scores(@memory_classifier) and classification_scores(@redis_classifier) look nearly identical but the hashes are different. I'm guessing this is a floating point math issue, but I'm not sure.

ibnesayeed · 2017-01-17T18:01:31Z

I changed the test to use words, which I think better reflects the intent...

You might want to make it something like 100.upto(110) do ....

Ch4s3 · 2017-01-17T18:03:04Z

@ibnesayeed thoughts on the failing test?

ibnesayeed · 2017-01-17T18:03:51Z

@ibnesayeed any insight into why bayesian_integration_test.rb:35 fails on newer rubies with this PR?

I am trying to look into it. If it was failing for all Rubies, I would have though there might be some edge condition on how numbers are stored in Redis.

Ch4s3 · 2017-01-17T18:05:16Z

It might be an issue in the Redis Ruby gem

ibnesayeed · 2017-01-17T18:25:40Z

I think the issue might be on another level. Because even the expected value changes of each run. That expected value is calculated by generating a hash digest of the array of the results and scores concatenated. If the array is somehow changing the order or the digest calculation is not reliable to test the equality then we might need to make some changes there.

ibnesayeed · 2017-01-17T18:31:41Z

I think I got the culprit, just need to dig deeper and see how it goes.

Ch4s3 · 2017-01-17T18:34:05Z

@ibnesayeed good catch

ibnesayeed · 2017-01-17T18:46:51Z

The issue is related to how classify_with_score method returns the value.

def classify_with_score(text)
  (classifications(text).sort_by { |a| -a[1] })[0]
end

This returns the first value with the highest score. However, if more than one categories have the same score, how to resolve the conflict? In this particular test case, there is an instance in the dataset on like 4499, there is a record ham Ok which is being used as test. Due to the only-stopword string, both teh categories are assigned Infinity value, and the method might return one or the other as the chosen one.

ibnesayeed · 2017-01-17T18:47:57Z

If the test uses a threshold, the situation will be little different as discussed in #123.

ibnesayeed · 2017-01-17T18:49:15Z

However, should have a fix for this situation without making any changes in the library code, only in the test. I will let you know those changes in a bit.

Ch4s3 · 2017-01-17T18:51:57Z

awesome, thanks!

ibnesayeed · 2017-01-17T19:04:26Z

Please cherry pick ibnesayeed@ed116ea

ibnesayeed · 2017-01-17T19:42:19Z

I was in rush and could not make proper PR on time. PR #133 should take care of the error.

ibnesayeed · 2017-01-17T20:13:47Z

It was interesting to find out that the record that was nothing but a stopword was paced at line 4499. We are using records 1-4000 for training and 4001-5000 for testing and it happened to be the second last record on the test set. This was coincidental, but a good one that made the test more reliable.

Ch4s3 · 2017-01-17T20:14:46Z

yeah, that was a good catch

ibnesayeed

LGTM

don't train or untrain empty word hashes

e297606

Ch4s3 mentioned this pull request Jan 17, 2017

How should the classifier behave against empty strings? #130

Closed

ibnesayeed suggested changes Jan 17, 2017

View reviewed changes

first pass on checking word hash first

e66c712

ibnesayeed reviewed Jan 17, 2017

View reviewed changes

simplified classifications and fixed test

5a4eb3b

ibnesayeed mentioned this pull request Jan 17, 2017

Deal with Infinity score in test #133

Merged

apply test patch

3c4ff8f

ibnesayeed approved these changes Jan 17, 2017

View reviewed changes

Ch4s3 merged commit 8bc8d55 into master Jan 17, 2017

Ch4s3 deleted the check_blank branch January 17, 2017 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't train or untrain empty word hashes #132

don't train or untrain empty word hashes #132

Ch4s3 commented Jan 17, 2017

ibnesayeed Jan 17, 2017

Ch4s3 Jan 17, 2017

ibnesayeed Jan 17, 2017

ibnesayeed Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017 •

edited

Loading

ibnesayeed Jan 17, 2017 •

edited

Loading

ibnesayeed commented Jan 17, 2017 •

edited

Loading

Ch4s3 commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017 •

edited

Loading

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed left a comment

don't train or untrain empty word hashes #132

don't train or untrain empty word hashes #132

Conversation

Ch4s3 commented Jan 17, 2017

ibnesayeed Jan 17, 2017

Choose a reason for hiding this comment

Ch4s3 Jan 17, 2017

Choose a reason for hiding this comment

ibnesayeed Jan 17, 2017

Choose a reason for hiding this comment

ibnesayeed Jan 17, 2017

Choose a reason for hiding this comment

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017 • edited Loading

ibnesayeed Jan 17, 2017 • edited Loading

Choose a reason for hiding this comment

ibnesayeed commented Jan 17, 2017 • edited Loading

Ch4s3 commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017 • edited Loading

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed left a comment

Choose a reason for hiding this comment

Ch4s3 commented Jan 17, 2017 •

edited

Loading

ibnesayeed Jan 17, 2017 •

edited

Loading

ibnesayeed commented Jan 17, 2017 •

edited

Loading

Ch4s3 commented Jan 17, 2017 •

edited

Loading