项目 fork 自 karpathy 的 llm.c,使用 C++(with Eigen) 来复现 GPT-2,支持 CPU/CUDA 计算。
- 所有的计算部分都通过 Eigen Tensor 完成,所以同样一份代码通过简单地切换 Device 就可完成 CPU/CUDA 的计算
- 这里实现的 GPT-2 与 PyTorch 版本是完全对齐的
- 值得注意的是,CPU 版本比 PyTorch 快大约 20%,但是 GPU 版本比 PyTorch GPU 慢得多,主要原因是 Eigen 的 Tensor 不支持 BatchMatmul
This repo is forked from karpathy's llm.c, using C++ (with Eigen) to reproduce GPT-2.
- All calculations are done through the Eigen Tensor Module, so the same code can be used for CPU/CUDA calculations by simply switching the Device.
- Currently, this repo has reproduced GPT-2 and the results are completely aligned with the PyTorch version.
- It is worth noting that CPU calculations are about 20% faster than PyTorch, while GPU calculations are still far behind PyTorch's GPU due to the difficulty of Eigen Tensor Module to support BatchMatmul.
pip install -r requirements.txt
python dev/data/tinyshakespeare.py
python train_gpt2.py
mkdir build && cd build
cmake ..
make train_gpt2_cpu
cd ../
./build/llmcpp/train_gpt2_cpu
The above lines
- (1) download the tinyshakespeare dataset, tokenize it with the GPT-2 Tokenizer
- (2) download and save the GPT-2 (124M) weights
- (3) init from them in C++ and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. The output looks like this on my LMDE3 (Intel© Core™ i7-10700K CPU @ 3.80GHz × 8):
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
padded_vocab_size: 50304
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124475904(474 MB)
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 82723584(315 MB)
val loss 5.325413
step 0: train loss 5.356086 (took 786.515755 ms)
step 1: train loss 4.300581 (took 677.340087 ms)
step 2: train loss 4.623053 (took 674.843167 ms)
step 3: train loss 4.599307 (took 673.189660 ms)
... (trunctated) ...
step 39: train loss 3.972404 (took 749.386021 ms)
val loss 4.017484
generating:
---
Requinetarius,
Which; supreme, but
Commands jest in vain for ever.
<|endoftext|>Lady:
No, heavens,
I were not to haste
To retire valorously and look nobly in the face,
Before this
UNHISILIUS UNDERDEINTS
---
step 40: train loss 4.378605 (took 692.830391 ms)
final 40 iters avg: 692.974 ms
mkdir build && cd build
cmake ..
make train_gpt2_gpu
cd ../
./build/llmcpp/train_gpt2_gpu
The data files inside /dev/data/(dataset).py
are responsible for downloading, tokenizing and saving the tokens to .bin files, readable easily from C. So for example when you run:
python dev/data/tinyshakespeare.py
We download and tokenize the tinyshakespeare dataset. The output of this looks like this:
writing 32,768 tokens to ./dev/data/tinyshakespeare/tiny_shakespeare_val.bin
writing 305,260 tokens to ./dev/data/tinyshakespeare/tiny_shakespeare_train.bin
The .bin files contain a short header (1024 bytes) and then a stream of tokens in uint16, indicating the token ids with the GPT-2 tokenizer. More datasets are available in /dev/data
.
I am also attaching a simple unit test for making sure our C++ code agrees with the PyTorch code. On the CPU as an example, compile and run with:
mkdir build && cd build
cmake ..
make test_gpt2_cpu
cd ../
./build/llmcpp/test_gpt2_cpu
This now loads the gpt2_124M_debug_state.bin
file that gets written by train_gpt2.py, runs a forward pass, compares the logits and loss with the PyTorch reference implementation, then it does 10 iterations of training with Adam and makes sure the losses match PyTorch.
This tests both the fp32 path and the mixed precision path. The test should pass and print overall okay: 1
.
MIT