Tiny Language Model (TLM)

Is it possible use Abstract Syntax Tree (AST) (AST-like) besides using regex in the Byte Pair Encoding (BPE) algorithm?
Can develop a tiny language model (TLM), character-based?

Based on this method that has some similarities with n-grams?

The quick brown fox jumps over the lazy dog.

⬇️ ⬆️

the, quick, brown, fox, jumps, over, the, lazy, dog --> word 9-gram or 9-wgram

⬇️ ⬆️

t, h, e = t + h + e = the --> char 3-gram or word 1-gram (3-cgram or 1-wgram)

q, u, i, c, k = q + u + i + c + k = quick --> char 5-gram or word 1-gram (5-cgram or 1-wgram)

...

⬇️ ⬆️

a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z --> char 1-gram or 1-cgram

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback