Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in casing #19

Open
bunny-therapist opened this issue Oct 30, 2024 · 11 comments
Open

Bug in casing #19

bunny-therapist opened this issue Oct 30, 2024 · 11 comments

Comments

@bunny-therapist
Copy link

The yake-rust crate counts tf_a and tf_u here:

            for occurrence in word {
                if occurrence.word.chars().all(|c| c.is_uppercase()) && occurrence.word.len() > 1 {
                    cand.tf_a += 1.0;
                }
                if occurrence.word.chars().nth(0).unwrap_or(' ').is_uppercase() && occurrence.shift != occurrence.shift_offset {
                    cand.tf_u += 1.0;
                }
            }

LIAAD yake has the same here:

            if len(word) == len([c for c in word if c.isupper()]):
                return "a"  # This corresponds to incrementing tf_a
            if len([c for c in word if c.isupper()]) == 1 and len(word) > 1 and word[0].isupper() and i > 0:
                return "n"  # This corresponds to incrementing tf_u

Just by looking at the case restrictions, it seems like yake-rust increments tf_uif the first character is uppercase, whereas LIAAD/yake increments if only the first character is uppercase (with an additional restriction i>0 which means "not the first word in a sentence" - the i is the index in the sentence - maybe the shift offset check is equivalent to this, I do not know).

This means that yake-rust increments for e.g. "TopCoder" whereas LIAAD/yake does not. Because LIAAD/yake sees that there are multiple uppercase letters ("T" and "C"), it skips it; yake-rust just cares that the "T" is uppercase.

This debug print shows that tf_u (called tf_n in LIAAD/yake, but aligned here) differs, thus causing casing to also differ:

// LIAAD/yake:
TopCoder: tf: 1.0, tf_a: 0.0, tf_u: 0.0, casing: 0.0, position: 0.0940478276166991, frequency: 1.0, different: 1.0, relatedness: 1.5
// yake-rust:
topcoder: tf: 1, tf_a: 0, tf_u: 1, casing: 1, position: 0.0940478276166991, frequency: 1, different: 1, relatedness: 1.5
@bunny-therapist
Copy link
Author

@xamgore

@bunny-therapist
Copy link
Author

Ok, I added one line to the tf_u condition (this can probably be written in a nicer way?):

                if occurrence.word.chars().nth(0).unwrap_or(' ').is_uppercase()

                    // New:
                    && occurrence.word.chars().filter_map(
                        |c| if c.is_uppercase() { Some(c) } else { None }
                        ).collect::<Vec<char>>().len() == 1

                    && occurrence.shift != occurrence.shift_offset
                {
                    cand.tf_u += 1.0;
                }

and that aligned the results for "TopCoder":

// LIAAD/yake:
TopCoder: tf: 1.0, tf_a: 0.0, tf_u: 0.0, casing: 0.0, position: 0.0940478276166991, frequency: 1.0, different: 1.0, relatedness: 1.5
// yake-rust:
topcoder: tf: 1, tf_a: 0, tf_u: 0, casing: 0, position: 0.0940478276166991, frequency: 1, different: 1, relatedness: 1.5

@xamgore
Copy link
Contributor

xamgore commented Oct 30, 2024

I completely agree with the reasoning. Curious, why LIAAD have decided not to mark such tokens as uppercase? The paper makes no accent on this, just “begins with an uppercase, excluding the beginning of sentences”.

Kyle's approach is still sound, though. PayPal, MasterCard are as important as Paypal or Mastercard.

@bunny-therapist
Copy link
Author

bunny-therapist commented Oct 30, 2024

I don't know exactly why. The LIAAD/yake repo does not seem "blameable" - most of the code was developed offline, then uploaded as is and not touched.

It can't be because of words like "I" because they also check if length > 1 - and that handles that.

Could it have something to do with abbreviations?

I suppose it is up to Kyle whether or not we should do our own thing here or try to align everything with LIAAD/yake as much as possible.

Personally, I would vote to align it. I am trying to replace LIAAD/yake in our projects with this, and I bet there are others who would love a drop-in replacement in rust (example). I will make python bindings so it can be used as a drop-in replacement that is much faster and concurrent than LIAAD/yake. (That is why I got into this, I was making python bindings to replace LIAAD, then I noticed I got different results.)

If there are differences, one is stuck arguing that our results make more sense. (Though I do agree with you about the logic, albeit I have no clue why LIAAD did what they did.)

@bunny-therapist
Copy link
Author

If we do it like LIAAD/yake, we should at least write a comment about it.

@xamgore
Copy link
Contributor

xamgore commented Oct 30, 2024

Ok, let it be just another field at Config 🤷🏻‍♂️.

Btw, probably Kyle could make us maintainers, so we could put python bindings right in the repo. Wasm too.

@bunny-therapist
Copy link
Author

Is there any benefit to having everything in the same repo though? I suppose automatic releases of everything together, but I worry that the project structure will get a bit complicated. Especially if one wants to add more bindings.

@bunny-therapist
Copy link
Author

Being able to release a new crate and a new python package together would be nice though. But I suppose it would still have to be a bit staggered, since one wants to publish to pypi after publishing the crate, since the python package would depend on the crate.

@xamgore
Copy link
Contributor

xamgore commented Oct 30, 2024

Can't say for sure, probably there is even a Github action exactly for such a pipeline. Do pyo3 bindings impose publishing a separate yake-rust-to-python crate?

@bunny-therapist
Copy link
Author

No. You publish a python package that has the rust crate as a dependency. The compiled binary gets built into any wheels built, but if there is no wheel matching the user's system, installing the python package will download the crate and compile it.

The python package contains python code, as well as some additional rust code that calls the crate, just with wrapping functions that can be decorated with macros to generate bindings. (I do not recommend adding these binding macros to the original rust code - you will likely end up with issues related to python garbage collection, threading, memory management etc. I intend to wrap them with additional rust code.)

@bunny-therapist
Copy link
Author

Or rather, I have already done that, provisionally. In my local project I have tests which run both yake-rust thru the bindings and LIAAD/yake, with same config, and then compares the result.

That is how I am debugging this stuff, mostly.

xamgore added a commit to xamgore/yake-rust that referenced this issue Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants