Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train BasicTokenizer on GPU with PyTorch, 100x speedup #38

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

kuprel
Copy link

@kuprel kuprel commented Feb 22, 2024

The following files are added:

  • minbpe/torch/base.py
    • Contains merge_torch
  • minbpe/torch/basic.py
    • Contains BasicTokenizerTorch, overrides the train and encode methods of BasicTokenizer
  • minbpe/torch/regex.py
    • Contains RegexTokenizerTorch, overrides the encode_ordinary method of RegexTokenizer
  • minbpe/torch/gpt4.py
    • Contains GPT4TokenizerTorch, mostly inherits from GPT4Tokenizer, but uses RegexTokenizerTorch's encode method
  • train_torch.py
    • Similar to train.py but trains BasicTokenizerTorch

The following files are modified:

  • minbpe/__init__.py
    • Import torch tokenizers
  • tests/test_tokenizer.py
    • Add torch tokenizers to tests

It takes 67.4 seconds on an H100 80GB SXM5 to train the BasicTokenizerTorch with a vocab_size of 512 on 308MB of Enron emails. The original code takes 2hrs 15min on an M2 Air with Python 3.11 to do this.

I'm not sure if RegexTokenizerTorch or GPT4TokenizerTorch can benefit much from pytorch since there are many chunks of varying lengths, i.e. a "ragged tensor". These tokenizers are helpful for sanity checks though. For example, the test_gpt4_tiktoken_equality tests all pass suggesting that merge_torch is correctly implemented.

I also made a new repository minbpe-pytorch in case adding pytorch support is beyond the scope of this project.

@kuprel kuprel changed the title Train BasicTokenizer on GPU with PyTorch Train BasicTokenizer on GPU with PyTorch, 55x speedup Feb 22, 2024
@kuprel kuprel changed the title Train BasicTokenizer on GPU with PyTorch, 55x speedup Train BasicTokenizer on GPU with PyTorch, 100x speedup Feb 23, 2024
@kuprel
Copy link
Author

kuprel commented Feb 23, 2024

Using an H100 and int16, it's now 108x speedup over the original implementation on M2 air

@kuprel
Copy link
Author

kuprel commented Feb 25, 2024

All of the tests pass

Screenshot 2024-02-24 at 8 48 33 PM

@karpathy
Copy link
Owner

Ok I'll step through this soon to take a look.
Not sure that I love duplicating everything and creating torch versions of it.
Would we be able to potentially isolate the def that is the bottleneck (I'm guessing in base.py), and just surgically have a fast version of one of those defs?
If that isn't straight forward happy to link to minbpe-pytorch.

@kuprel
Copy link
Author

kuprel commented Feb 27, 2024

Thanks for the feedback! I made the diff more surgical. Now the only added files are:

  • minbpe/basic_torch.py
    • Contains merge_torch and BasicTorchTokenizer, overrides the train and encode methods of BasicTokenizer
  • train_torch.py
    • Similar to train.py but trains BasicTorchTokenizer

And the following files are lightly modified:

  • minbpe/__init__.py
    • Import BasicTorchTokenizer
  • tests/test_tokenizer.py
    • Add BasicTorchTokenizer to tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants