-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dictionary index case-sensitivity inconsistencies #121
Conversation
I would rather not add workarounds for invalid dictionaries. I have already mentioned this problem. Where did you download the dictionary from? |
I downloaded mine via arch's user repository, but I confirmed debian's official package repository compiles the index with the same inconsistent case. I see that your post cites the same dictionary, the reason I used it is because it's seemingly the largest english dictionary available, there really don't seem to be too many options. Since If that's really the case, no pun intended, I would argue that it shouldn't be considered a workaround and just standard data munging. |
@baskerville Is there anything I can do to address any concerns you may have? If the primary If there are concerns with the code itself, let me know and I can work to address your concerns. Thanks. |
I don't know why When You could prevent the headwords from being lowercased with |
This is conjecture, but I believe Case-sensitive handling was introduced in 2007 and the default handling was lowercasing the index which is probably why it never got addressed within I understand your reluctance, but |
Being aware of the dictionary situation, I did create my own version of WordNet 3.1 so that there would be at least one good english dictionary that works with Plato. |
There would be another good english dictionary that works with Plato if this code was merged though. It works with the official Why is the solution to produce and maintain another bespoke dictionary instead of leveraging what has existed for twenty+ years? |
Don't get me wrong: I'm acknowledging the weird backward compatibility problem. I'm just looking the for most straightforward approach to solving this problem. Fortunately, the dictionaries generated with Have you found other dictd dictionaries, besides GCIDE, that aren't generated by |
Yes, the dict-moby-thesaurus package does not conform either. It maintains case for proper nouns, but does not declare I did not realize you are also involved in https://github.com/freedict/libdict so I understand that your concerns about this may extend more outwards than I was aware of. I really believe the most straightforward solution is to case fold the query and the index headword. Related to this, I did discover the rust-caseless default case folding function does not perform any normalization. The canonical caseless matching strategy recommended in Unicode Section 3.13 requires NFD normalization before/after case folding of a given word while noting NFD normalization after case folding is sufficient to handle most cases. However, Rust's unicode-normalization crate was recently shown to be 2-25x slower than ICU so maybe we don't want to deal with this at all for now.
|
Hi,
I pulled the gcide dictionary into plato and noticed that the case-sensitivity search was not working. This was because the gcide dictionary index does not case fold the headwords in the index. Looking at some other dictionaries, this seems to be inconsistently handled.
So this PR provides the following fixes (and tests):
Handling of dictionary index parsing from three possible states:
Casefolding (accounting for non-latin characters) for the dictionary-side query and when the index is being created within plato
Tested via emulator and on my Forma: