Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some more test cases for tokenization and ascii folding #501

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

SiarheiFedartsou
Copy link
Contributor

@SiarheiFedartsou SiarheiFedartsou commented Dec 7, 2024

👋 I did some awesome work for the Pelias project and would love for everyone to have a look at it and provide feedback.


Here's the reason for this change 🚀

There are some questions in discussion of this PR #498, so I'd like to propose to extend a bit existing test harness to kind of document current state of the things. So current schema:

  • tokenizes by whitespace, hyphen, slashes (but doesn't by "dash" character for example - I'd expect ICU tokenizer to work differently here btw 🤔 )
  • normalizes thai digits to Arabic ones (I think we can extrapolate it to any other digits writing system)
  • removes tonal marks in Thai script
  • we never make digits "glued" to the end of word a separate token
  • we don't tokenize Asian languages which don't use whitespaces properly

Here's what actually got changed 👏

  • Added tests

Here's how others can test the changes 👀

Run tests :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant