Weird discrepancy with ICU4X #105

aumetra · 2024-10-09T13:19:20Z

So, to set the scene here, I have a proptest between two libraries set up. One of the libraries uses unicode-normalization under the hood, the other icu_normalizer.

I expected that both output the same values, but my CI exploded at some point on the weird string "\u{11366}\u{113ce}".
When put through the NFC normalizer, you get two different outputs:

unicode-normalization: "\u{113ce}\u{11366}"
icu_normalizer: "\u{11366}\u{113ce}"

Just a fun little thing I thought I'd report since it's technically a correctness issue (I'm just not good enough with Unicode to determine whether it's an issue with ICU4X or this crate).

The text was updated successfully, but these errors were encountered:

aumetra · 2024-10-09T13:33:43Z

One more. I don't know why proptest suddenly finds so many:

Original: "\u{113c2}\u{113b8}"
unicode-normalization: "\u{113c7}"
icu_normalizer: "\u{113c2}\u{113b8}"

Manishearth · 2024-10-09T22:07:43Z

This crate hasn't been updated to Unicode 16.0 yet. Doing so is not super straightforward this time due to some of the newer characters having interesting combinations of properties.

aumetra · 2024-10-09T22:36:03Z

Ah, interesting. Thanks for the info! Good to know this is due to a new standard revision and not due to a bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird discrepancy with ICU4X #105

Weird discrepancy with ICU4X #105

aumetra commented Oct 9, 2024

aumetra commented Oct 9, 2024

Manishearth commented Oct 9, 2024

aumetra commented Oct 9, 2024

Weird discrepancy with ICU4X #105

Weird discrepancy with ICU4X #105

Comments

aumetra commented Oct 9, 2024

aumetra commented Oct 9, 2024

Manishearth commented Oct 9, 2024

aumetra commented Oct 9, 2024