Skip to content

l1351868270/ld_qwen2.c

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ld_qwen2.c

本项目零依赖支持大模型推理以及微调(fine-tune)。本项目不依赖任何框架,从零实现大模型的训练推理。因为支持的第一个模型是通义千问,所以起名ld_qwen2.c

目录说明

single_deploy: 单文件部署

src: 模型部署文件

tools: 工具文件,包括模型export, python run wrapper, demo等

scripts: 脚本文件

安装

git clone https://github.com/l1351868270/ld_qwen2.c.git

cd ld_qwen2.c

python tools/export.py --filepath="qwen1.5-0.5B-fp16.bin" --dtype="fp16" --model_type=Qwen/Qwen1.5-0.5B-Chat

make clean

make single_W16A16 或者 make qwen2

python tools/run.py -p "天空为什么是蓝色的" -m "Qwen/Qwen1.5-0.5B-Chat" -q fp16 --batch 1

或者

./scripts/run_qwen2_0.5B.sh

已完成

CPU

x86 avx512,

aarch64 neon

kv cache

混合精度

accumulate float

a,b half

算子融合

group attention

flash attention

GHA

MHA

量化

Vector-wise Quantization

W8A16

W4A16

batch

naive batch / static batch

Doing

模型并行: Tensor并行

TODO

kv cache

paged attention

量化

activation 量化

batch

continuous batch / in-flight batch

训练

单机单卡

单机多卡

多机多卡

参考资料

大模型qwen2

qwen2 github code

qwen2 transformers code

qwen2 blog

量化

llm.int8

SmoothQuant

AWQ

高性能计算

cs267

深入理解并行编程(Is Parallel Programming Hard, And, If So, What Can You Do About It?)

cuda

PTX

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

GPUs Go Brrr

模型并行

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

continuous batching

How continuous batching enables 23x throughput in LLM inference while reducing p50 latency

flash-attention

From Online Softmax to FlashAttention

Flash-Decoding for long-context inference

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published