Experimental 3Lin Transformer implementation

This repo contains experimental LLM train and inference code written in C++ and CUDA from scratch with concise (~1000 lines) reference CPU implementation. Code uses experiemental Triple-Linear (3Lin) transformer model. 3Lin transformer lacks ReLu nonlinearities and still achieves competative results. CUDA train implementation is optimized for consumer 40* series GPUs. Models can be trained directly in int8 format consuming 1 byte per parameter of GPU memory during training. To train model at scale code supports multiple GPU per host and distributed training. For distributed training code uses regular TCP sockets. To reduce network traffic gradients are packed to 1-bit per parameter.

Build

To build the the code CUDA v12.3 and C++ compiler are required, msvc for windows, cmake+clang for Linux. To support cross platform build files generation this repo uses fo, lightweight solution/build files generator. To generate build files you need to compile fo/fo.cpp and run it with two arguments. First argument is root of source tree, second argument is directory to store build files to.

Windows

D:\3lin>fo.exe code sln

Then open code.sln from d:\3lin\sln\code.sln.

Linux

To compile 3lin for linux you need to compile fo.cpp, generate CMakeLists.txt file, run cmake, run make.

~/3lin/fo$ clang++17 fo.cpp -o fo
~/3lin/fo$ cd ..
~/3lin$ ./fo/fo code make.dir
~/3lin$ cd make.dir
~/3lin/make.dir$ cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo .
~/3lin/make.dir$ make

Get train data

Examples in the code use enwik9 dataset and its truncacted version enwik8. Also Hugging Face hosted datasets openwebtext, ontocord/CulturaY, danasone/librusec are used in examples. To import them use hf_import.

Train model

gpt_train is used to train a model. It is controlled by the train script. Default train script is stored in main_gpt.cpp CONFIG variable. To load train script from file run gpt_train with '-c script.txt' argument.

quick run

Compile gpt-train. Run it in the root directory with test config:

~/3lin$ ./make.dir/gpt-train -c test.cfg

distributed run

Currently training can be distributed only among pow2 number of worker hosts.

To start a worker process run gpt_train with '-w 10000' argument. 10000 specifies port number to use.

To run master process call net_train('worker.txt') function in train script. List worker IP addresses in the file provided to net_train().

multiple GPU

To use multiple GPU devices set DEVICE_COUNT variable in train script to number of GPUs to use. For distributed runs DEVICE_COUNT is applied on each worker, heterogeneous configurations are not supported.

Inference test

To try inferencing from the trained model you can use gpt_infer. It runs basic http server on 11311 port and allows sampling continuations from the model. Current implementation is slow and designed for demonstration purposes.

Tokenizers

Tokenizers are created by gpt_tokenizer.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
code		code
doc		doc
fo		fo
hf_import		hf_import
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
test.cfg		test.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experimental 3Lin Transformer implementation

Build

Windows

Linux

Get train data

Train model

quick run

distributed run

multiple GPU

Inference test

Tokenizers

License

About

Releases

Packages

Languages

License

andrei-pokrovsky/3Lin

Folders and files

Latest commit

History

Repository files navigation

Experimental 3Lin Transformer implementation

Build

Windows

Linux

Get train data

Train model

quick run

distributed run

multiple GPU

Inference test

Tokenizers

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages