-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(common/models): add Trie string-encoding + decoding methods 💾 #11088
feat(common/models): add Trie string-encoding + decoding methods 💾 #11088
Conversation
User Test ResultsTest specification and instructions User tests are not required Test Artifacts
|
// Offsetting by even just 0x0020 avoids control-code chars + avoids VS Code not liking the encoding. | ||
const ENCODED_NUM_BASE = 0x0000; | ||
const SINGLE_CHAR_RANGE = Math.pow(2, 16) - ENCODED_NUM_BASE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A notable differentiation from the pseudo-spec established in #10336. It does restrict the data ranges slightly, but it makes a big, positive difference in the encoding and how IDEs interpret the resulting encoded Trie data when written to files.
Note: the unit tests established later currently do not adjust for alternate ENCODED_NUM_BASE
values. This shouldn't be too tricky to establish for reasonable value selections, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}`; | ||
|
||
const compression = { | ||
// Achieves FAR better compression than JSON.stringify, which \u-escapes most chars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless using ENCODED_NUM_BASE=0x0020
or similar. JSON.stringify
likes to \u
-escape control characters, which leads to notable string-bloat when the control characters are utilized - they tend to represent values that appear with high frequency in the encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> JSON.stringify(String.fromCharCode(1))
'"\\u0001"'
With ENCODED_NUM_BASE=0
...
-
note that most leaf notes will have low, single-digit
.entries
counts, all of which would be represented by control codes and thus subject to\u
-escaping. The word lengths will usually be notably less than 32 chars and thus would also subject to the same effects... leading to most encoded entries having length less than 32 chars. -
also, most near-leaf internal nodes will have but a few legal
values
leading to child nodes, once again using control codes for their representation.
JSON.stringify use is much prettier and straightforward with ENCODED_NUM_BASE=0x0020
or above, as this bypasses the control-code range with one exception: 0x007f (DEL).
…ssion Aims for a UCS-2 encoded string and does not shy away from unpaired surrogates in the encoding.
… code When a leaf node exists at the same Trie location as an internal node, it should be a child of that internal node using SENTINEL_CODE_UNIT (\ufdd0). The fixture was using null/undefined instead!
… for compressed-Trie code
b40b7c7
to
467efd4
Compare
No, I think we just ensure that the data structure is versioned. Then we can change format if we need to by bumping the version. |
This would mean that future new 'versions' are assumed 'unreadable' by older versions. A bit of work would avoid that issue, letting old-version compatible code to read the parts of the data structure it recognizes while ignoring the parts it hasn't yet seen. We're already talking about a breaking-change format moving to 18.0, where 17.0 and before won't be able to load models with a Trie-encoding format based on these changes. I would prefer to minimize the risk of causing that again. That said, we can always leave it as a future work item, linking my thoughts in that direction above as a note for implementation before 18.0 release. |
This is premature design. Do you have a specific use case that you are designing for where you can demonstrate that this will be sufficient to support that feature? If not, then reserving the space may not end up being adequate anyway, and we may still need a newer version of Keyman to support the new functionality -- and we've done this premature design for nothing. We had to play these games in the past when upgrading was more difficult. The less we have to do it going forward, the better. Keyman is becoming increasingly evergreen, and reserving space for an unknown future use is not the right thing to do -- either in terms of file formats, or in terms of APIs, or in internal design. Design for use now. Refactor when needed. Have good version management -- and that is sufficient. |
Example 1: localized emoji-oriented models Noting the structure of the package I referenced for #905, they actually include use of GitHub and Slack style emoji code input: To be clear, that package also supports keywords associated with the emojis - the prefix of any associated keyword will suggest the associated emoji. We'd likely have multiple entries per emoji - one per keyword for the emoji + its full-length standard tag - all of which should also provide the associated emoji... or the English tag for that emoji. Either version - the emoji codepoint or the English tag - is not closely associated with the key or the localized emoji tag - and will likely motivate an extra field. If desired, we should be able to create Crowdin localization files based on the backing data's associated JSON structure and programmatically construct them if using the "English tag" variant. (That "English tag" would serve as the string's localization key.) This would allow progressive localization of emojis as a group effort by each language community, in theory, as a potential approach. Example 2: Agglutinative models In agglutinative models, we'll likely want to store affixes in tries for lookup. Granted, it's unclear if we'll want one Trie per affix type or all affixes in the same Trie... but if it's the latter, we may wish to 'tag' the affix's type with its entry. Even if not that, I imagine word roots will have some sort of list of legal associated affixes - and that would likely need to be tagged onto each root's Trie entry.
You say this, but we've taken great pains to maintain backward compatibility for keyboards. Last I checked, we still actively support KMW 1.0 and 2.0 keyboards! Why are models being treated differently? |
I guess I'm probably conflating reuse of the same code paths, with extensions for specific versions, with a desire for backward compatibility. Could always have a "new version" later that uses the same functions where they apply, then adds extra sections to work with the new data. What I'm really hoping for with that: to avoid the need for major variations in the encoding patterns as we make more specific versions in the future. Not so much for the 18.0 release to actually use extended data from Trie versions that don't yet exist. |
That's backward compatibility, not forward compatibility |
return compressed; | ||
} | ||
|
||
const ENTRY_HEADER_WIDTH = NODE_SIZE_WIDTH + WEIGHT_WIDTH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's NODE_SIZE_WIDTH
? ENTRYLEN_WIDTH
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc-comments have been added.
const weight = decompressNumber(str, baseIndex + NODE_SIZE_WIDTH, baseIndex + NODE_SIZE_WIDTH + WEIGHT_WIDTH); | ||
|
||
// Assumes string-subsection size check has passed. | ||
const childCnt = decompressNumber(str, baseIndex + NODE_TYPE_INDEX, baseIndex + NODE_TYPE_INDEX + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we call compressNumber
for the count of children?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 195 211 for internal nodes.
const valueCntAndType = values.length;
// Requires full Trie integration, which isn't included yet. | ||
}); | ||
}); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a test for decompressing \ufdd0
?
Co-authored-by: Eberhard Beilharz <[email protected]>
This PR aims to help address #10336, though not sufficient for a full fix.
So far, this PR establishes methods useful to "compress" / encode our lexical model Tries into a notably more compact format and to "decompress" / decode them from that format, one piece at a time.
It does not currently:
So, there's obviously more work that would be needed, but it's a solid start in the right direction.
Possibly worth considering: do we build in some reserved ranges now in case we need future extensibility? See #11088 (comment).
@keymanapp-test-bot skip