-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with "Tutorial: Getting Started using the LLAMA driver" #218
Comments
i'm hitting the same issue |
I get the same using output using the alpaca model as well. Note sure if this helps but I can get this working with the Llama-2-7B-Chat-ggml model following the steps below. Download the model: $ wget -P models https://huggingface.co/localmodels/Llama-2-7B-Chat-ggml/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin Create a Python virtual environment to run the conversion tool: $ python3 -m venv llamav
$ source llamav/bin/activate
(llamav) $ pip install -r requirements.txt
Collecting numpy==1.24.4
Using cached numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting sentencepiece==0.1.98
Using cached sentencepiece-0.1.98-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Collecting gguf>=0.1.0
Using cached gguf-0.4.4-py3-none-any.whl (10 kB)
Installing collected packages: sentencepiece, numpy, gguf
Successfully installed gguf-0.4.4 numpy-1.24.4 sentencepiece-0.1.98
[notice] A new release of pip available: 22.2.2 -> 23.3
[notice] To update, run: pip install --upgrade pip Convert the model: (llamav) $ ./convert-llama-ggml-to-gguf.py --input models/llama-2-7b-chat.ggmlv3.q4_0.bin --output models/llama-2-7b-chat.gguf.q4_0.bin
* Using config: Namespace(input=PosixPath('models/llama-2-7b-chat.ggmlv3.q4_0.bin'), output=PosixPath('models/llama-2-7b-chat.gguf.q4_0.bin'), name=None, desc=None, gqa=1, eps='5.0e-06', context_length=2048, model_metadata_dir=None, vocab_dir=None, vocabtype='spm')
=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===
- Note: If converting LLaMA2, specifying "--eps 1e-5" is required. 70B models also need "--gqa 8".
* Scanning GGML input file
* File format: GGJTv3 with ftype MOSTLY_Q4_0
* GGML model hyperparameters: <Hyperparameters: n_vocab=32000, n_embd=4096, n_mult=256, n_head=32, n_layer=32, n_rot=128, n_ff=11008, ftype=MOSTLY_Q4_0>
=== WARNING === Special tokens may not be converted correctly. Use --model-metadata-dir if possible === WARNING ===
* Preparing to save GGUF file
* Adding model parameters and KV items
* Adding 32000 vocab item(s)
* Adding 291 tensor(s)
gguf: write header
gguf: write metadata
gguf: write tensors
* Successful completion. Output saved to: models/llama-2-7b-chat.gguf.q4_0.bin Then run using: $ ./main -m models/llama-2-7b-chat.gguf.q4_0.bin --prompt "I love Rust because"
Log start
main: build = 1399 (004797f)
main: built with cc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1) for x86_64-redhat-linux
main: seed = 1697721286
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b-chat.gguf.q4_0.bin (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 1: output_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 2: output.weight q4_0 [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 10: blk.0.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 11: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 19: blk.1.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 20: blk.1.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 21: blk.2.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 22: blk.2.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 23: blk.2.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 24: blk.2.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 25: blk.2.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 26: blk.2.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 27: blk.2.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 28: blk.2.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 29: blk.2.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 30: blk.3.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 31: blk.3.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 32: blk.3.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 33: blk.3.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 34: blk.3.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 35: blk.3.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 36: blk.3.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 37: blk.3.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 38: blk.3.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 39: blk.4.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 40: blk.4.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 41: blk.4.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 42: blk.4.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 43: blk.4.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 44: blk.4.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 45: blk.4.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 46: blk.4.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 47: blk.4.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 48: blk.5.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 49: blk.5.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 50: blk.5.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 51: blk.5.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 52: blk.5.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 53: blk.5.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 54: blk.5.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 55: blk.5.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 56: blk.5.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 57: blk.6.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 58: blk.6.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 59: blk.6.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 60: blk.6.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 61: blk.6.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 62: blk.6.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 63: blk.6.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 64: blk.6.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 65: blk.6.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 66: blk.7.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 67: blk.7.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 68: blk.7.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 69: blk.7.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 70: blk.7.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 71: blk.7.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 72: blk.7.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 73: blk.7.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 74: blk.7.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 75: blk.8.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 76: blk.8.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 77: blk.8.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 78: blk.8.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 79: blk.8.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 80: blk.8.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 81: blk.8.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 82: blk.8.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 83: blk.8.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 84: blk.9.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 85: blk.9.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 86: blk.9.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 87: blk.9.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 88: blk.9.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 89: blk.9.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 90: blk.9.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 91: blk.9.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 92: blk.9.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 93: blk.10.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 94: blk.10.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 95: blk.10.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 96: blk.10.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 97: blk.10.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 98: blk.10.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 99: blk.10.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 100: blk.10.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 101: blk.10.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 102: blk.11.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 103: blk.11.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 104: blk.11.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 105: blk.11.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 106: blk.11.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 107: blk.11.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 108: blk.11.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 109: blk.11.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 110: blk.11.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 111: blk.12.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 112: blk.12.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 113: blk.12.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 114: blk.12.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 115: blk.12.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 116: blk.12.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 117: blk.12.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 118: blk.12.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 119: blk.12.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 120: blk.13.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 121: blk.13.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 122: blk.13.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 123: blk.13.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 124: blk.13.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 125: blk.13.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 126: blk.13.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 127: blk.13.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 128: blk.13.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 129: blk.14.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 130: blk.14.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 131: blk.14.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 132: blk.14.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 133: blk.14.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 134: blk.14.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 135: blk.14.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 136: blk.14.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 137: blk.14.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 138: blk.15.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 139: blk.15.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 140: blk.15.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 141: blk.15.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 142: blk.15.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 143: blk.15.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 144: blk.15.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 145: blk.15.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 146: blk.15.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 147: blk.16.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 148: blk.16.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 149: blk.16.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 150: blk.16.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 151: blk.16.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 152: blk.16.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 153: blk.16.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 154: blk.16.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 155: blk.16.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 156: blk.17.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 157: blk.17.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 158: blk.17.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 159: blk.17.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 160: blk.17.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 161: blk.17.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 162: blk.17.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 163: blk.17.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 164: blk.17.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 165: blk.18.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 166: blk.18.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 167: blk.18.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 168: blk.18.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 169: blk.18.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 170: blk.18.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 171: blk.18.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 172: blk.18.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 173: blk.18.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 174: blk.19.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 175: blk.19.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 176: blk.19.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 177: blk.19.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 178: blk.19.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 179: blk.19.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 180: blk.19.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 181: blk.19.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 182: blk.19.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 183: blk.20.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 184: blk.20.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 185: blk.20.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 186: blk.20.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 187: blk.20.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 188: blk.20.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 189: blk.20.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 190: blk.20.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 191: blk.20.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 192: blk.21.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 193: blk.21.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 194: blk.21.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 195: blk.21.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 196: blk.21.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 197: blk.21.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 198: blk.21.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 199: blk.21.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 200: blk.21.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 201: blk.22.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 202: blk.22.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 203: blk.22.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 204: blk.22.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 205: blk.22.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 206: blk.22.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 207: blk.22.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 208: blk.22.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 209: blk.22.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 210: blk.23.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 211: blk.23.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 212: blk.23.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 213: blk.23.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 214: blk.23.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 215: blk.23.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 216: blk.23.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 217: blk.23.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 218: blk.23.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 219: blk.24.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 220: blk.24.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 221: blk.24.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 222: blk.24.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 223: blk.24.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 224: blk.24.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 225: blk.24.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 226: blk.24.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 227: blk.24.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 228: blk.25.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 229: blk.25.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 230: blk.25.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 231: blk.25.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 232: blk.25.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 233: blk.25.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 234: blk.25.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 235: blk.25.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 236: blk.25.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 237: blk.26.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 238: blk.26.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 239: blk.26.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 240: blk.26.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 241: blk.26.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 242: blk.26.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 243: blk.26.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 244: blk.26.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 245: blk.26.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 246: blk.27.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 247: blk.27.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 248: blk.27.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 249: blk.27.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 250: blk.27.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 251: blk.27.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 252: blk.27.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 253: blk.27.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 254: blk.27.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 255: blk.28.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 256: blk.28.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 257: blk.28.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 258: blk.28.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 259: blk.28.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 260: blk.28.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 261: blk.28.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 262: blk.28.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 263: blk.28.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 264: blk.29.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 265: blk.29.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 266: blk.29.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 267: blk.29.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 268: blk.29.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 269: blk.29.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 270: blk.29.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 271: blk.29.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 272: blk.29.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 273: blk.30.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 274: blk.30.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 275: blk.30.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 276: blk.30.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 277: blk.30.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 278: blk.30.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 279: blk.30.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 280: blk.30.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 281: blk.30.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 282: blk.31.attn_q.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 283: blk.31.attn_k.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 284: blk.31.attn_v.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 285: blk.31.attn_output.weight q4_0 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 286: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 287: blk.31.ffn_gate.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 288: blk.31.ffn_down.weight q4_0 [ 11008, 4096, 1, 1 ]
llama_model_loader: - tensor 289: blk.31.ffn_up.weight q4_0 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 290: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: general.description str
llama_model_loader: - kv 3: general.file_type u32
llama_model_loader: - kv 4: llama.context_length u32
llama_model_loader: - kv 5: llama.embedding_length u32
llama_model_loader: - kv 6: llama.block_count u32
llama_model_loader: - kv 7: llama.feed_forward_length u32
llama_model_loader: - kv 8: llama.rope.dimension_count u32
llama_model_loader: - kv 9: llama.attention.head_count u32
llama_model_loader: - kv 10: llama.attention.head_count_kv u32
llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 5.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.53 GiB (4.50 BPW)
llm_load_print_meta: general.name = llama-2-7b-chat.ggmlv3.q4_0.bin
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.10 MB
llm_load_tensors: mem required = 3615.74 MB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_new_context_with_model: compute buffer total size = 76.63 MB
system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
I love Rust because it's a programming language that values safety and performance.ϊ It's designed to prevent common programming errors like null pointer dereferences, data races, and buffer overflows, which can lead to crashes or security vulnerabilities. This makes it an excellent choice for building systems that require high reliability, such as operating systems, file systems, and web browsers.
Rust also has a strong focus on performance, with a goal of competing with C++ in terms of execution speed. It achieves this through a combination of language features like zero-cost abstractions, compile-time evaluation of loops, and smart borrowing, which allow for more efficient code generation. Additionally, Rust's ownership system helps prevent data races and other performance-related issues, making it an ideal choice for high-performance applications.
Another advantage of Rust is its growing ecosystem of libraries and frameworks, which provide a wide range of tools and resources for building different types of applications. For example, the `rocket` framework makes it easy to build web applications with Rust, while the `tokio` library provides a lightweight, non-blocking I/O library for building concurrent systems. This means that Rust can be used for a wide range of applications beyond just low-level system programming, and its ecosystem is constantly evolving to support new use cases.
However, it's important to note that Rust may not be the best choice for every project. Some languages like Python or JavaScript are more suitable for certain types of projects, such as data science or web development, due to their ease of use and extensive libraries for those domains. Additionally, Rust's steep learning curve can make it difficult for beginners to pick up, especially compared to more established programming languages like Java or C#.
In conclusion, Rust is an excellent choice for building systems that require safety, performance, and concurrency. Its focus on safety and performance makes it ideal for building high-reliability systems, while its growing ecosystem of libraries and frameworks means it can be used for a wide range of applications beyond just low-level system programming. However, Rust may not be the best choice for every project due to its steep learning curve and limited adoption in certain domains. [end of text]
llama_print_timings: load time = 2010.55 ms
llama_print_timings: sample time = 2944.60 ms / 478 runs ( 6.16 ms per token, 162.33 tokens per second)
llama_print_timings: prompt eval time = 477.93 ms / 6 tokens ( 79.66 ms per token, 12.55 tokens per second)
llama_print_timings: eval time = 105234.34 ms / 477 runs ( 220.62 ms per token, 4.53 tokens per second)
llama_print_timings: total time = 109146.41 ms
Log end
|
I was just wandering around so for my case, this was a solution. Can't say the same thing about OP. Apparently, the tutorial pages at "https://docs.llm-chain.xyz/" is deprecated. Also, the code that is shared in LLAMA Driver (Under "Tutorial: Getting Started using the LLAMA driver" page, step 6) expects a GGML model. If I can download a GGML model, what is the point to convert it to gguf? If gguf is better in some ways, so have can I use it in these tutorials? By the way, here is how I'm trying to use the model in "main.rs":
And here is the error as result:
|
looking at the bindgen it seems like there's support for both GGML and GGUF in 0.13
just CTRL + F for either and you'll see an abundance. |
|
Alright luvlies, seems like it's fixed on main. If you want to use this you'll need to clone the repo including the submodules (we need llama.cpp to make the bindings from scratch):
and then your [package]
name = "my_package"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
llm-chain = { path = "../llm-chain/crates/llm-chain" }
llm-chain-llama = { path = "../llm-chain/crates/llm-chain-llama" }
tokio = { version = "1.37.0", features = ["full"] } I was using version
and the fix was delivered here: cad5646 (Dec 2023) |
The Problem
Following the tutorial linked in this project's readme is either incorrect, out of date, or results in a bad build.
Description
I followed the tutorial to get started using the llama driver, and all went well until it asked me to run this command in "Step 4":
python convert.py ./models/alpaca-native
The instructions say that running this in the project root directory should output a
ggml-model-f32.bin
file in./models/alpaca-native
, however it outputs a file with.gguf
extension instead:ggml-model-f32.gguf
.The next instruction asks me to run this command:
./main -m models/alpaca-native/ggml-model-f32.bin -n 128 -p "I love Rust because"
Because there is no
ggml-model-f32.bin
, this command will not run.Failed Solutions
I tried to replace the
.bin
extension with.gguf
and run the command the tutorial instructs as this instead:./main -m models/alpaca-native/ggml-model-f32.gguf -n 128 -p "I love Rust because"
The program runs, and after a minute or two it has this output and is still running:
I love Rust because ÂÄÄÄÄ
While I will yell "AAAA! I love Rust" from time to time, this isn't really what I was hoping for from the program.
What can I do to complete the tutorial successfully?
The text was updated successfully, but these errors were encountered: