[Bug]: Run LLMs with OpenVINO GenAI Flavor on NPU #1216

taikai-zz · 2024-11-15T01:14:14Z

OpenVINO Version

Name: openvino
Version: 2024.4.0
Summary: OpenVINO(TM) Runtime
Home-page: https://docs.openvino.ai/2023.0/index.html
Author: Intel(R) Corporation
Author-email: [email protected]
License: OSI Approved :: Apache Software License
Location: /root/openvino_env/lib/python3.12/site-packages
Requires: numpy, openvino-telemetry, packaging
Required-by: openvino-tokenizers

Operating System

Ubuntu 24.04 LTS　　Linux ubuntu 6.8.0-48-generic

Device used for inference

NPU

Framework

None

Model used

TinyLlama

Issue description

Refer to official documentation：
https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html

This is my hardware information

import openvino_genai as ov_genai
help(ov_genai.LLMPipeline)

As shown in the above figure: The device does not support NPU. I followed the instructions in the document and changed it to NPU, but the result was empty. Changing it to CPU or GPU restored normal operation. May I ask where I made the operation error?

There is another issue: if you check the usage rate of NPU in Ubuntu environment, such as tools like nvidia-smi

Wan-Intel · 2024-11-21T04:42:13Z

Did you encountered issue when using the latest version of OpenVINO™ GenAI?

You may use the latest OpenVINO™ GenAI and use NPU by following the steps in Run LLMs with OpenVINO GenAI Flavor on NPU.

helena-intel · 2024-11-26T19:24:10Z

In addition to using the latest OpenVINO GenAI, if you haven't exported the model with --sym, please try exporting the model with that option. This should work:

optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group-size 128 TinyLlama-1.1B-Chat-v1.0

Also note that for NPU, you should add do_sample=False to the pipe.generate() call. See the documentation for more limitations/recommendations.

taikai-zz · 2024-12-04T07:34:21Z

optimum-cli export openvino -m Qwen/Qwen2-7B --weight-format int4 --sym --ratio 1.0 --group-size 128 Qwen2-7B

helena-intel · 2024-12-04T10:57:31Z

Unfortunately, there is an issue with using NPU for LLMs on Ubuntu. The NPU team is working on it; the issue is not with OpenVINO, but on the kernel/driver level. I am sorry you're running into this. We will keep you informed.

dmatveev · 2024-12-04T11:44:38Z

This is 32GB MTL.. Ticket was opened for TinyLLaMa, but logs are mentioning group-quantized QWEN2-7B - a completely different league

Wan-Intel · 2024-12-10T09:07:29Z

I also encountered issue only when using NPU.

I'll escalate the case to relevant the team and we'll provide an update as soon as possible.

helena-intel · 2024-12-20T17:43:54Z

@taikai-zz a new NPU driver was released today with a fix for LLM on LNL: https://github.com/intel/linux-npu-driver/releases/tag/v1.10.1 Could you check if that fixes the issue for you? We also had a new openvino-genai release this week, 2024.6, with performance improvements on NPU, so please upgrade with pip install --upgrade openvino-genai.

Also note that for running larger LLMs (>4B) you should use per-channel quantization. This note will be added to the docs too, I'm mentioning it here because I see you're using a 7B model. Instead of group-size 128, you should specify group-size -1 (note the minus sign). This is an example from the docs for Llama-2-7b: optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset=wikitext2 Llama-2-7b-chat-hf

taikai-zz · 2024-12-24T01:59:06Z

Thank you for your help. It has now returned to normal, but the speed is too slow。

There is an error in the document, please fix it

helena-intel · 2024-12-24T08:59:40Z

I'm glad to hear the issue is fixed! For faster speed, please see the document (same one you screenshotted) about model caching. That will speed up model loading time. Since model loading time only occurs once, it's also useful to measure inference time, by adding start = time.perf_counter() and end = time.perf_counter() before and after pipe.generate()and then showing duration asprint(end-start)`.

The group_size -1 enables channel-wise quantization, your screenshot is from the group quantization tab. Also note that I recommended this for larger models. For the 1.1B model, group quantization will work fine too. As I mentioned, this will be clarified in the docs.

taikai-zz added bug Something isn't working support_request Support team labels Nov 15, 2024

ilya-lavrenov transferred this issue from openvinotoolkit/openvino Nov 15, 2024

ilya-lavrenov assigned l-bat and TolyaTalamanov and unassigned l-bat Nov 15, 2024

ilya-lavrenov added category: LLM LLM pipeline (stateful, static) category: NPU labels Nov 15, 2024

YuChern-Intel assigned Wan-Intel Nov 20, 2024

Wan-Intel added the PSE label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Run LLMs with OpenVINO GenAI Flavor on NPU #1216

[Bug]: Run LLMs with OpenVINO GenAI Flavor on NPU #1216

taikai-zz commented Nov 15, 2024 •

edited

Loading

Wan-Intel commented Nov 21, 2024

helena-intel commented Nov 26, 2024

taikai-zz commented Dec 4, 2024 •

edited

Loading

helena-intel commented Dec 4, 2024

dmatveev commented Dec 4, 2024 •

edited

Loading

Wan-Intel commented Dec 10, 2024 •

edited

Loading

helena-intel commented Dec 20, 2024

taikai-zz commented Dec 24, 2024

helena-intel commented Dec 24, 2024

[Bug]: Run LLMs with OpenVINO GenAI Flavor on NPU #1216

[Bug]: Run LLMs with OpenVINO GenAI Flavor on NPU #1216

Comments

taikai-zz commented Nov 15, 2024 • edited Loading

OpenVINO Version

Operating System

Device used for inference

Framework

Model used

Issue description

Wan-Intel commented Nov 21, 2024

helena-intel commented Nov 26, 2024

taikai-zz commented Dec 4, 2024 • edited Loading

helena-intel commented Dec 4, 2024

dmatveev commented Dec 4, 2024 • edited Loading

Wan-Intel commented Dec 10, 2024 • edited Loading

helena-intel commented Dec 20, 2024

taikai-zz commented Dec 24, 2024

helena-intel commented Dec 24, 2024

taikai-zz commented Nov 15, 2024 •

edited

Loading

taikai-zz commented Dec 4, 2024 •

edited

Loading

dmatveev commented Dec 4, 2024 •

edited

Loading

Wan-Intel commented Dec 10, 2024 •

edited

Loading