You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The text was updated successfully, but these errors were encountered:
laurislopata
changed the title
Huggingface already has an efficient implementation of this
Huggingface already has an efficient implementation of this?
Mar 19, 2024
Hugging face's tokenizers not support for all non-english language
I too am convinced that HF already supports training a BPE tokenizer, but I am relatively new to this, could you elaborate? I thought that any text can be put into their tokenizers and it just works?
I too am convinced that HF already supports training a BPE tokenizer, but I am relatively new to this, could you elaborate? I thought that any text can be put into their tokenizers and it just works?
i'm not sure, i think i need to implement BPE tokenizer from scratch for easy to use.... you may like karpathy's minBPE
When Karpathy claimed an efficient implementation of the BPE optimizer doesn't exist, I did some research and found this on Hugging Face: https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/trainer.rs
Isn't this exactly what Karpathy was creating?
The text was updated successfully, but these errors were encountered: