ld_qwen2.c

本项目零依赖支持大模型推理以及微调(fine-tune)。本项目不依赖任何框架，从零实现大模型的训练推理。因为支持的第一个模型是通义千问，所以起名ld_qwen2.c

目录说明

single_deploy: 单文件部署

src: 模型部署文件

tools: 工具文件，包括模型export, python run wrapper, demo等

scripts: 脚本文件

安装

git clone https://github.com/l1351868270/ld_qwen2.c.git

cd ld_qwen2.c

python tools/export.py --filepath="qwen1.5-0.5B-fp16.bin" --dtype="fp16" --model_type=Qwen/Qwen1.5-0.5B-Chat

make clean

make single_W16A16 或者 make qwen2

python tools/run.py -p "天空为什么是蓝色的" -m "Qwen/Qwen1.5-0.5B-Chat" -q fp16 --batch 1

或者

./scripts/run_qwen2_0.5B.sh

已完成

CPU

x86 avx512,

aarch64 neon

kv cache

混合精度

accumulate float

a,b half

算子融合

group attention

flash attention

GHA

MHA

量化

Vector-wise Quantization

W8A16

W4A16

batch

naive batch / static batch

Doing

模型并行: Tensor并行

TODO

kv cache

paged attention

量化

activation 量化

batch

continuous batch / in-flight batch

训练

单机单卡

单机多卡

多机多卡

参考资料

大模型qwen2

qwen2 github code

qwen2 transformers code

qwen2 blog

量化

llm.int8

SmoothQuant

AWQ

高性能计算

cs267

深入理解并行编程(Is Parallel Programming Hard, And, If So, What Can You Do About It?)

cuda

PTX

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

GPUs Go Brrr

模型并行

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

continuous batching

How continuous batching enables 23x throughput in LLM inference while reducing p50 latency

flash-attention

From Online Softmax to FlashAttention

Flash-Decoding for long-context inference

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
dev/cuda		dev/cuda
scripts		scripts
single_deploy		single_deploy
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
readme_lm_eval.md		readme_lm_eval.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ld_qwen2.c

目录说明

安装

已完成

CPU

kv cache

混合精度

算子融合

量化

Vector-wise Quantization

batch

Doing

TODO

kv cache

量化

batch

训练

参考资料

大模型qwen2

量化

高性能计算

cuda

模型并行

continuous batching

flash-attention

About

Releases

Packages

Languages

License

l1351868270/ld_qwen2.c

Folders and files

Latest commit

History

Repository files navigation

ld_qwen2.c

目录说明

安装

已完成

CPU

kv cache

混合精度

算子融合

量化

Vector-wise Quantization

batch

Doing

TODO

kv cache

量化

batch

训练

参考资料

大模型qwen2

量化

高性能计算

cuda

模型并行

continuous batching

flash-attention

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages