Add some more test cases for tokenization and ascii folding #501

SiarheiFedartsou · 2024-12-07T12:58:25Z

👋 I did some awesome work for the Pelias project and would love for everyone to have a look at it and provide feedback.

Here's the reason for this change 🚀

There are some questions in discussion of this PR #498, so I'd like to propose to extend a bit existing test harness to kind of document current state of the things. So current schema:

tokenizes by whitespace, hyphen, slashes (but doesn't by "dash" character for example - I'd expect ICU tokenizer to work differently here btw 🤔 )
normalizes thai digits to Arabic ones (I think we can extrapolate it to any other digits writing system)
removes tonal marks in Thai script
we never make digits "glued" to the end of word a separate token
we don't tokenize Asian languages which don't use whitespaces properly

Here's what actually got changed 👏

Added tests

Here's how others can test the changes 👀

Run tests :)

SiarheiFedartsou added 2 commits December 7, 2024 13:51

Add some more test cases for tokenization and ascii folding

5c90483

Add some more test cases for tokenization and ascii folding

f038644

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some more test cases for tokenization and ascii folding #501

Add some more test cases for tokenization and ascii folding #501

SiarheiFedartsou commented Dec 7, 2024 •

edited

Loading

Add some more test cases for tokenization and ascii folding #501

Are you sure you want to change the base?

Add some more test cases for tokenization and ascii folding #501

Conversation

SiarheiFedartsou commented Dec 7, 2024 • edited Loading

Here's the reason for this change 🚀

Here's what actually got changed 👏

Here's how others can test the changes 👀

SiarheiFedartsou commented Dec 7, 2024 •

edited

Loading