本项目零依赖支持大模型推理以及微调(fine-tune)。本项目不依赖任何框架,从零实现大模型的训练推理。因为支持的第一个模型是通义千问,所以起名ld_qwen2.c
single_deploy: 单文件部署
src: 模型部署文件
tools: 工具文件,包括模型export, python run wrapper, demo等
scripts: 脚本文件
git clone https://github.com/l1351868270/ld_qwen2.c.git
cd ld_qwen2.c
python tools/export.py --filepath="qwen1.5-0.5B-fp16.bin" --dtype="fp16" --model_type=Qwen/Qwen1.5-0.5B-Chat
make clean
make single_W16A16 或者 make qwen2
python tools/run.py -p "天空为什么是蓝色的" -m "Qwen/Qwen1.5-0.5B-Chat" -q fp16 --batch 1
或者
./scripts/run_qwen2_0.5B.sh
x86 avx512,
aarch64 neon
accumulate float
a,b half
group attention
flash attention
GHA
MHA
W8A16
W4A16
naive batch / static batch
模型并行: Tensor并行
paged attention
activation 量化
continuous batch / in-flight batch
单机单卡
单机多卡
多机多卡
深入理解并行编程(Is Parallel Programming Hard, And, If So, What Can You Do About It?)
Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency