Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

Wrong tokenizer used for OpenAI embeddings #31

Open
darknoon opened this issue Feb 18, 2023 · 1 comment
Open

Wrong tokenizer used for OpenAI embeddings #31

darknoon opened this issue Feb 18, 2023 · 1 comment

Comments

@darknoon
Copy link

I was looking through the OpenAI code and noticed that the wrong tokenizer is used for newer models like text-embedding-ada-002 that use cl100k, implemented by tiktoken.

There is a list of encodings here for their public models.

I'm currently looking at making a wasm build of tiktoken, though I think a pure js approach would also work fine.

@cfortuner
Copy link
Owner

This might work -> https://www.npmjs.com/package/@dqbd/tiktoken @darknoon

Let me know

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants