Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with punctuation and context building #14

Open
bunny-therapist opened this issue Oct 29, 2024 · 6 comments
Open

Issue with punctuation and context building #14

bunny-therapist opened this issue Oct 29, 2024 · 6 comments

Comments

@bunny-therapist
Copy link

The created contexts contain punctuation symbols. If a word is just composed of punctuation symbols, it should be skipped and the buffer emptied.

I fixed this is my branch here: bunny-therapist@a9c3a99

The relevant part in LIAAD/yake is here: https://github.com/LIAAD/yake/blob/master/yake/datarepresentation.py#L59
The "exclude" chars in LIAAD/yake are what is called "punctuation" in yake-rust.

@bunny-therapist
Copy link
Author

@xamgore

@bunny-therapist
Copy link
Author

@xamgore - just making sure you saw this. Maybe you found a simpler way to solve this?

@xamgore
Copy link
Contributor

xamgore commented Oct 29, 2024

Python code is a mess, really no way for me to comprehend it right now 😄 maybe on the weekend

I've seen that candidate_filtering also throws punctuation words out.

@bunny-therapist
Copy link
Author

Yeah, but vocabulary is different from candidates apparently. The "buffer/buffer_words" vec we have end up with elements like "!" and "?" which then affect ctx.0/1 and thus wr, and wl, and thus frequency and relatedness.

@xamgore
Copy link
Contributor

xamgore commented Oct 29, 2024

Hm, right. Can the candidate words have punctuation inside like abc!?def? I've never dealt with unicode segmentation.

xamgore pushed a commit to xamgore/yake-rust that referenced this issue Oct 29, 2024
@bunny-therapist
Copy link
Author

Based on the python code, it skips words that are composed entirely of punctuation. So "abc!?def" would be ok, but "!?" would not.

xamgore pushed a commit to xamgore/yake-rust that referenced this issue Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants