Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parenthesis to score expression #13

Merged
merged 1 commit into from
Oct 28, 2024

Conversation

bunny-therapist
Copy link

Closes #10

@quesurifn
Copy link
Owner

quesurifn commented Oct 28, 2024

@bunny-therapist Do you mind checking against the sample on their website?

http://yake.inesctec.pt/demo.html?doc=Sample1

It looks like the parentheses causes a decrease in accuracy.

@quesurifn
Copy link
Owner

quesurifn commented Oct 28, 2024

@bunny-therapist Do you have a discord we can discuss this on? I'm looking at this. The scores look better with this version but the actual results seem slightly worse. There seems to be an increase of results with adjectives and I don't think that's what we want.

I'm thinking this could be a scenario where the fix you've posted is uncovering a deficiency somewhere else but I'm not sure.

@bunny-therapist
Copy link
Author

I am running it against their sample, but I am not getting the same results with either this PR or the pre-existing yake-rust code.

I am trying to create python binding for yake-rust so we can replace LIAAD/yake in our projects. For this reason, I am running tests comparing the yake-rust results to LIAAD/yake - that is how I am finding these issues.

Even with this PR, I am not getting the same as LIAAD/yake nor the homepage. However, I believe that is because there are more issues here. I think this PR fixes one issue, but I still think there are issues related to relatedness and frequency (when I am comparing LIAAD/yake and yake-rust, the discrepancies appear to be coming from those two).

@bunny-therapist
Copy link
Author

I am pretty sure this is also a bug: #12

But I think there may be more.

@bunny-therapist
Copy link
Author

bunny-therapist commented Oct 28, 2024

Do we expect agreement with the yake homepage? Because in that case we should just use their scores for the tests in the future.

How exactly do you tell the accuracy? I don't know what to look for.

@bunny-therapist
Copy link
Author

@bunny-therapist Do you have a discord we can discuss this on? I'm looking at this. The scores look better with this version but the actual results seem slightly worse. There seems to be an increase of results with adjectives and I don't think that's what we want.

I'm thinking this could be a scenario where the fix you've posted is uncovering a deficiency somewhere else but I'm not sure.

No I do not have a discord. I have only used the discords of others to discuss their projects. I am not a very experienced discord user.

@bunny-therapist
Copy link
Author

If I do the changes from #12 together with these changes, we get

        let results: Results = vec![
            ResultItem { raw: "data science".to_owned(), keyword: "data science".to_owned(), score: 0.0599 },
            ResultItem { raw: "Google Cloud Platform".to_owned(), keyword: "google cloud platform".to_owned(), score: 0.0656 },
            ResultItem { raw: "acquiring data science".to_owned(), keyword: "acquiring data science".to_owned(), score: 0.0735 },
            ResultItem { raw: "science community Kaggle".to_owned(), keyword: "science community kaggle".to_owned(), score: 0.0804 },
            ResultItem { raw: "acquiring Kaggle".to_owned(), keyword: "acquiring kaggle".to_owned(), score: 0.0924 },
            ResultItem { raw: "CEO Anthony Goldbloom".to_owned(), keyword: "ceo anthony goldbloom".to_owned(), score: 0.096 },
            ResultItem { raw: "Google Cloud".to_owned(), keyword: "google cloud".to_owned(), score: 0.1085 },
            ResultItem { raw: "Kaggle".to_owned(), keyword: "kaggle".to_owned(), score: 0.1178 },
            ResultItem { raw: "Google".to_owned(), keyword: "google".to_owned(), score: 0.1357 },
            ResultItem { raw: "machine learning".to_owned(), keyword: "machine learning".to_owned(), score: 0.1513 },
        ];

@quesurifn
Copy link
Owner

quesurifn commented Oct 28, 2024

@bunny-therapist I can't. I'm going based on what I'd expect so not objective by any measure. For me though the science community kaggle and acquiring kaggle results shouldn't be in that top 10 list. Everything else looks okay. The scoring seems generally off though still though better than before because we're in the hundredths place instead of the tenths which is what the top results are on their website.

When I originally released this I remember noticing the issue but thinking the results were close enough.

To answer your question, yes I think we should shoot for their scores.

@quesurifn
Copy link
Owner

quesurifn commented Oct 28, 2024

I think the play here is to branch off of main, maybe create a "v1.0.0" branch to work towards 1:1 scores with that. That way if we start going the wrong direction we can easily just keep whats in main because I think it works well enough if we can't get to 1:1 scoring.

Let me know your thoughts. And yes, sorry for calling you out like that before. Coffee is kicking in now. I promise to be pleasant.

@bunny-therapist
Copy link
Author

To work toward their scores, that sounds like a good idea. But we are merging this and looking into the other bug, right?
Or do you mean to do that as part of that branch?

@quesurifn
Copy link
Owner

quesurifn commented Oct 28, 2024

@bunny-therapist I'll create a v1.0.0-alpha branch and release this as that.

@quesurifn quesurifn changed the base branch from master to v1.0.0-alpha October 28, 2024 18:49
@bunny-therapist
Copy link
Author

I can post more about the other bugs I reported and what scores we get tomorrow. I have to stop for today since it is getting late here.

@quesurifn quesurifn merged commit b400f1f into quesurifn:v1.0.0-alpha Oct 28, 2024
@quesurifn
Copy link
Owner

Thanks for your help. I'll release this today sometime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Likely bug in score normalization
2 participants