Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(common/models): add Trie string-encoding + decoding methods 💾 #11088

Merged

Conversation

jahorton
Copy link
Contributor

@jahorton jahorton commented Mar 27, 2024

This PR aims to help address #10336, though not sufficient for a full fix.

So far, this PR establishes methods useful to "compress" / encode our lexical model Tries into a notably more compact format and to "decompress" / decode them from that format, one piece at a time.

It does not currently:

  • link these methods into our existing Trie definitions
  • use them at run-time
  • use them when compiling lexical models

So, there's obviously more work that would be needed, but it's a solid start in the right direction.


Possibly worth considering: do we build in some reserved ranges now in case we need future extensibility? See #11088 (comment).

@keymanapp-test-bot skip

@jahorton jahorton added this to the 18.0 milestone Mar 27, 2024
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Mar 27, 2024
@keymanapp-test-bot
Copy link

keymanapp-test-bot bot commented Mar 27, 2024

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

Comment on lines 6 to 8
// Offsetting by even just 0x0020 avoids control-code chars + avoids VS Code not liking the encoding.
const ENCODED_NUM_BASE = 0x0000;
const SINGLE_CHAR_RANGE = Math.pow(2, 16) - ENCODED_NUM_BASE;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A notable differentiation from the pseudo-spec established in #10336. It does restrict the data ranges slightly, but it makes a big, positive difference in the encoding and how IDEs interpret the resulting encoded Trie data when written to files.

Note: the unit tests established later currently do not adjust for alternate ENCODED_NUM_BASE values. This shouldn't be too tricky to establish for reasonable value selections, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fixed version of the fixture, utilizing #11074. My issues using this fixture during development of this PR were the cause of #11073's discovery.

}`;

const compression = {
// Achieves FAR better compression than JSON.stringify, which \u-escapes most chars.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless using ENCODED_NUM_BASE=0x0020 or similar. JSON.stringify likes to \u-escape control characters, which leads to notable string-bloat when the control characters are utilized - they tend to represent values that appear with high frequency in the encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> JSON.stringify(String.fromCharCode(1))
'"\\u0001"'

With ENCODED_NUM_BASE=0...

  • note that most leaf notes will have low, single-digit .entries counts, all of which would be represented by control codes and thus subject to \u-escaping. The word lengths will usually be notably less than 32 chars and thus would also subject to the same effects... leading to most encoded entries having length less than 32 chars.

  • also, most near-leaf internal nodes will have but a few legal values leading to child nodes, once again using control codes for their representation.

JSON.stringify use is much prettier and straightforward with ENCODED_NUM_BASE=0x0020 or above, as this bypasses the control-code range with one exception: 0x007f (DEL).

@jahorton jahorton changed the base branch from master to change/common/models/templates/trie-results-through-traversal July 2, 2024 04:25
@jahorton jahorton force-pushed the feat/common/models/templates/trie-compression-start branch from b40b7c7 to 467efd4 Compare July 2, 2024 04:25
Base automatically changed from change/common/models/templates/trie-results-through-traversal to master July 8, 2024 07:37
@jahorton jahorton changed the base branch from epic/user-dict to chore/models/seed-epic-model-encoding August 23, 2024 05:01
@jahorton jahorton changed the title feat(common/models): add Trie string-encoding + decoding methods 💾 📖 feat(common/models): add Trie string-encoding + decoding methods 💾 Aug 23, 2024
@mcdurdin mcdurdin modified the milestones: 18.0, A18S9 Aug 26, 2024
@mcdurdin
Copy link
Member

do we build in some reserved ranges now in case we need future extensibility?

No, I think we just ensure that the data structure is versioned. Then we can change format if we need to by bumping the version.

@jahorton
Copy link
Contributor Author

jahorton commented Aug 26, 2024

do we build in some reserved ranges now in case we need future extensibility?

No, I think we just ensure that the data structure is versioned. Then we can change format if we need to by bumping the version.

This would mean that future new 'versions' are assumed 'unreadable' by older versions. A bit of work would avoid that issue, letting old-version compatible code to read the parts of the data structure it recognizes while ignoring the parts it hasn't yet seen.

We're already talking about a breaking-change format moving to 18.0, where 17.0 and before won't be able to load models with a Trie-encoding format based on these changes. I would prefer to minimize the risk of causing that again.

That said, we can always leave it as a future work item, linking my thoughts in that direction above as a note for implementation before 18.0 release.

@mcdurdin
Copy link
Member

This would mean that future new 'versions' are assumed 'unreadable' by older versions. A bit of work would avoid that issue, letting old-version compatible code to read the parts of the data structure it recognizes while ignoring the parts it hasn't yet seen.

We're already talking about a breaking-change format moving to 18.0, where 17.0 and before won't be able to load models with a Trie-encoding format based on these changes. I would prefer to minimize the risk of causing that again.

This is premature design. Do you have a specific use case that you are designing for where you can demonstrate that this will be sufficient to support that feature? If not, then reserving the space may not end up being adequate anyway, and we may still need a newer version of Keyman to support the new functionality -- and we've done this premature design for nothing.

We had to play these games in the past when upgrading was more difficult. The less we have to do it going forward, the better.

Keyman is becoming increasingly evergreen, and reserving space for an unknown future use is not the right thing to do -- either in terms of file formats, or in terms of APIs, or in internal design. Design for use now. Refactor when needed. Have good version management -- and that is sufficient.

@jahorton
Copy link
Contributor Author

jahorton commented Aug 26, 2024

This is premature design. Do you have a specific use case that you are designing for where you can demonstrate that this will be sufficient to support that feature? If not, then reserving the space may not end up being adequate anyway, and we may still need a newer version of Keyman to support the new functionality -- and we've done this premature design for nothing.

We had to play these games in the past when upgrading was more difficult. The less we have to do it going forward, the better.

Example 1: localized emoji-oriented models

Noting the structure of the package I referenced for #905, they actually include use of GitHub and Slack style emoji code input: :emoji-name-here:. That works well... for English. These would probably be localized for different languages, and likely should be possible to look up based on their localized name. This is also separate from their actual encoded emoji character, the primary associated output.

To be clear, that package also supports keywords associated with the emojis - the prefix of any associated keyword will suggest the associated emoji. We'd likely have multiple entries per emoji - one per keyword for the emoji + its full-length standard tag - all of which should also provide the associated emoji... or the English tag for that emoji. Either version - the emoji codepoint or the English tag - is not closely associated with the key or the localized emoji tag - and will likely motivate an extra field.

If desired, we should be able to create Crowdin localization files based on the backing data's associated JSON structure and programmatically construct them if using the "English tag" variant. (That "English tag" would serve as the string's localization key.) This would allow progressive localization of emojis as a group effort by each language community, in theory, as a potential approach.

Example 2: Agglutinative models

In agglutinative models, we'll likely want to store affixes in tries for lookup. Granted, it's unclear if we'll want one Trie per affix type or all affixes in the same Trie... but if it's the latter, we may wish to 'tag' the affix's type with its entry.

Even if not that, I imagine word roots will have some sort of list of legal associated affixes - and that would likely need to be tagged onto each root's Trie entry.

Keyman is becoming increasingly evergreen, and reserving space for an unknown future use is not the right thing to do -- either in terms of file formats, or in terms of APIs, or in internal design. Design for use now. Refactor when needed. Have good version management -- and that is sufficient.

You say this, but we've taken great pains to maintain backward compatibility for keyboards. Last I checked, we still actively support KMW 1.0 and 2.0 keyboards! Why are models being treated differently?

Base automatically changed from chore/models/seed-epic-model-encoding to epic/model-encoding August 27, 2024 03:06
@jahorton
Copy link
Contributor Author

jahorton commented Aug 27, 2024

I guess I'm probably conflating reuse of the same code paths, with extensions for specific versions, with a desire for backward compatibility. Could always have a "new version" later that uses the same functions where they apply, then adds extra sections to work with the new data.

What I'm really hoping for with that: to avoid the need for major variations in the encoding patterns as we make more specific versions in the future. Not so much for the 18.0 release to actually use extended data from Trie versions that don't yet exist.

@mcdurdin
Copy link
Member

You say this, but we've taken great pains to maintain backward compatibility for keyboards. Last I checked, we still actively support KMW 1.0 and 2.0 keyboards! Why are models being treated differently?

That's backward compatibility, not forward compatibility

common/models/templates/src/trie-compression.ts Outdated Show resolved Hide resolved
common/models/templates/src/trie-compression.ts Outdated Show resolved Hide resolved
return compressed;
}

const ENTRY_HEADER_WIDTH = NODE_SIZE_WIDTH + WEIGHT_WIDTH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's NODE_SIZE_WIDTH? ENTRYLEN_WIDTH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc-comments have been added.

common/models/templates/src/trie-compression.ts Outdated Show resolved Hide resolved
common/models/templates/src/trie-compression.ts Outdated Show resolved Hide resolved
const weight = decompressNumber(str, baseIndex + NODE_SIZE_WIDTH, baseIndex + NODE_SIZE_WIDTH + WEIGHT_WIDTH);

// Assumes string-subsection size check has passed.
const childCnt = decompressNumber(str, baseIndex + NODE_TYPE_INDEX, baseIndex + NODE_TYPE_INDEX + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we call compressNumber for the count of children?

Copy link
Contributor Author

@jahorton jahorton Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 195 211 for internal nodes.

const valueCntAndType = values.length;

common/models/templates/test/test-trie-compression.js Outdated Show resolved Hide resolved
// Requires full Trie integration, which isn't included yet.
});
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a test for decompressing \ufdd0?

@jahorton jahorton merged commit 2093ef9 into epic/model-encoding Aug 28, 2024
16 of 17 checks passed
@jahorton jahorton deleted the feat/common/models/templates/trie-compression-start branch August 28, 2024 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants