diff --git a/docs/articles_en/learn-openvino.rst b/docs/articles_en/learn-openvino.rst index 4fca64051003a7..98797c9c67c126 100644 --- a/docs/articles_en/learn-openvino.rst +++ b/docs/articles_en/learn-openvino.rst @@ -14,7 +14,7 @@ Learn OpenVINO Interactive Tutorials (Python) Sample Applications (Python & C++) - Large Language Model Inference Guide + Generative AI workflow @@ -29,5 +29,5 @@ as well as an experienced user. | :doc:`OpenVINO Samples ` | The OpenVINO samples (Python and C++) are simple console applications that show how to use specific OpenVINO API features. They can assist you in executing tasks such as loading a model, running inference, querying particular device capabilities, etc. -| :doc:`Large Language Models in OpenVINO ` +| :doc:`Generative AI workflow ` | Detailed information on how OpenVINO accelerates Generative AI use cases and what models it supports. This tutorial provides instructions for running Generative AI models using Hugging Face Optimum Intel and Native OpenVINO APIs. diff --git a/docs/articles_en/learn-openvino/llm_inference_guide.rst b/docs/articles_en/learn-openvino/llm_inference_guide.rst index 36c001c015f744..bfc4f9b4c49173 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide.rst @@ -1,140 +1,106 @@ -Large Language Model Inference Guide +Generative AI workflow ======================================== .. meta:: - :description: Explore learning materials, including interactive - Python tutorials and sample console applications that explain - how to use OpenVINO features. + :description: learn how to use OpenVINO to run generative AI models. .. toctree:: :maxdepth: 1 :hidden: - Run LLMs with Optimum Intel - Run LLMs on OpenVINO GenAI Flavor - Run LLMs on Base OpenVINO + Inference with OpenVINO GenAI + Inference with Optimum Intel + Generative AI with Base OpenVINO (not recommended) OpenVINO Tokenizers -Large Language Models (LLMs) like GPT are transformative deep learning networks capable of a -broad range of natural language tasks, from text generation to language translation. OpenVINO -optimizes the deployment of these models, enhancing their performance and integration into -various applications. This guide shows how to use LLMs with OpenVINO, from model loading and -conversion to advanced use cases. + + +Generative AI is a specific area of Deep Learning models used for producing new and “original” +data, based on input in the form of image, sound, or natural language text. Due to their +complexity and size, generative AI pipelines are more difficult to deploy and run efficiently. +OpenVINO simplifies the process and ensures high-performance integrations, with the following +options: + +.. tab-set:: + + .. tab-item:: OpenVINO GenAI + + | - Suggested for production deployment for the supported use cases. + | - Smaller footprint and fewer dependencies. + | - More optimization and customization options. + | - Available in both Python and C++. + | - A limited set of supported use cases. + + :doc:`Install the OpenVINO GenAI package <../get-started/install-openvino/install-openvino-genai>` + and run generative models out of the box. With custom + API and tokenizers, among other components, it manages the essential tasks such as the + text generation loop, tokenization, and scheduling, offering ease of use and high + performance. + + .. tab-item:: Hugging Face integration + + | - Suggested for prototyping and, if the use case is not covered by OpenVINO GenAI, production. + | - Bigger footprint and more dependencies. + | - Limited customization due to Hugging Face dependency. + | - Not usable for C++ applications. + | - A very wide range of supported models. + + Using Optimum Intel is a great way to experiment with different models and scenarios, + thanks to a simple interface for the popular API and infrastructure offered by Hugging Face. + It also enables weight compression with + `Neural Network Compression Framework (NNCF) `__, + as well as conversion on the fly. For integration with the final product it may offer + lower performance, though. + +`Check out the GenAI Quick-start Guide [PDF] `__ The advantages of using OpenVINO for LLM deployment: -* **OpenVINO offers optimized LLM inference**: - provides a full C/C++ API, leading to faster operation than Python-based runtimes; includes a - Python API for rapid development, with the option for further optimization in C++. -* **Compatible with diverse hardware**: - supports CPUs, GPUs, and neural accelerators across ARM and x86/x64 architectures, integrated - Intel® Processor Graphics, discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data - Center GPU Flex Series; features automated optimization to maximize performance on target - hardware. -* **Requires fewer dependencies**: - than frameworks like Hugging Face and PyTorch, resulting in a smaller binary size and reduced - memory footprint, making deployments easier and updates more manageable. -* **Provides compression and precision management techniques**: - such as 8-bit and 4-bit weight compression, including embedding layers, and storage format - reduction. This includes fp16 precision for non-compressed models and int8/int4 for compressed - models, like GPTQ models from `Hugging Face `__. -* **Supports a wide range of deep learning models and architectures**: - including text, image, and audio generative models like Llama 2, MPT, OPT, Stable Diffusion, - Stable Diffusion XL. This enables the development of multimodal applications, allowing for - write-once, deploy-anywhere capabilities. -* **Enhances inference capabilities**: - fused inference primitives such as Scaled Dot Product Attention, Rotary Positional Embedding, - Group Query Attention, and Mixture of Experts. It also offers advanced features like in-place - KV-cache, dynamic quantization, KV-cache quantization and encapsulation, dynamic beam size - configuration, and speculative sampling. -* **Provides stateful model optimization**: - models from the Hugging Face Transformers are converted into a stateful form, optimizing - inference performance and memory usage in long-running text generation tasks by managing past - KV-cache tensors more efficiently internally. This feature is automatically activated for many - supported models, while unsupported ones remain stateless. Learn more about the - :doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`. - -OpenVINO offers three main paths for Generative AI use cases: - -* **Hugging Face**: use OpenVINO as a backend for Hugging Face frameworks (transformers, - diffusers) through the `Optimum Intel `__ - extension. -* **OpenVINO GenAI Flavor**: use OpenVINO GenAI APIs (Python and C++). -* **Base OpenVINO**: use OpenVINO native APIs (Python and C++) with - `custom pipeline code `__. - -In both cases, the OpenVINO runtime is used for inference, and OpenVINO tools are used for -optimization. The main differences are in footprint size, ease of use, and customizability. - -The Hugging Face API is easy to learn, provides a simple interface and hides the complexity of -model initialization and text generation for a better developer experience. However, it has more -dependencies, less customization, and cannot be ported to C/C++. - -The OpenVINO GenAI Flavor reduces the complexity of LLMs implementation by -automatically managing essential tasks like the text generation loop, tokenization, -and scheduling. The Native OpenVINO API provides a more hands-on experience, -requiring manual setup of these functions. Both methods are designed to minimize dependencies -and the overall application footprint and enable the use of generative models in C++ applications. - -It is recommended to start with Hugging Face frameworks to experiment with different models and -scenarios. Then the model can be used with OpenVINO APIs if it needs to be optimized -further. Optimum Intel provides interfaces that enable model optimization (weight compression) -using `Neural Network Compression Framework (NNCF) `__, -and export models to the OpenVINO model format for use in native API applications. - -Proceed to run LLMs with: +.. dropdown:: Fewer dependencies and smaller footprint + :animate: fade-in-slide-down + :color: secondary + + Less bloated than frameworks such as Hugging Face and PyTorch, with a smaller binary size and reduced + memory footprint, makes deployments easier and updates more manageable. + +.. dropdown:: Compression and precision management + :animate: fade-in-slide-down + :color: secondary + + Techniques such as 8-bit and 4-bit weight compression, including embedding layers, and storage + format reduction. This includes fp16 precision for non-compressed models and int8/int4 for + compressed models, like GPTQ models from `Hugging Face `__. + +.. dropdown:: Enhanced inference capabilities + :animate: fade-in-slide-down + :color: secondary + + Advanced features like in-place KV-cache, dynamic quantization, KV-cache quantization and + encapsulation, dynamic beam size configuration, and speculative sampling, and more are + available. + +.. dropdown:: Stateful model optimization + :animate: fade-in-slide-down + :color: secondary + + Models from the Hugging Face Transformers are converted into a stateful form, optimizing + inference performance and memory usage in long-running text generation tasks by managing past + KV-cache tensors more efficiently internally. This feature is automatically activated for + many supported models, while unsupported ones remain stateless. Learn more about the + :doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`. + +.. dropdown:: Optimized LLM inference + :animate: fade-in-slide-down + :color: secondary + + Includes a Python API for rapid development and C++ for further optimization, offering + better performance than Python-based runtimes. + + +Proceed to guides on: -* :doc:`Hugging Face and Optimum Intel <./llm_inference_guide/llm-inference-hf>` * :doc:`OpenVINO GenAI Flavor <./llm_inference_guide/genai-guide>` -* :doc:`Native OpenVINO API <./llm_inference_guide/llm-inference-native-ov>` - -The table below summarizes the differences between Hugging Face and the native OpenVINO API -approaches. - -.. dropdown:: Differences between Hugging Face and the native OpenVINO API - - .. list-table:: - :widths: 20 25 55 - :header-rows: 1 - - * - - - Hugging Face through OpenVINO - - OpenVINO Native API - * - Model support - - Supports transformer-based models such as LLMs - - Supports all model architectures from most frameworks - * - APIs - - Python (Hugging Face API) - - Python, C++ (OpenVINO API) - * - Model Format - - Source Framework / OpenVINO - - Source Framework / OpenVINO - * - Inference code - - Hugging Face based - - Custom inference pipelines - * - Additional dependencies - - Many Hugging Face dependencies - - Lightweight (e.g. numpy, etc.) - * - Application footprint - - Large - - Small - * - Pre/post-processing and glue code - - Provided through high-level Hugging Face APIs - - Must be custom implemented (see OpenVINO samples and notebooks) - * - Performance - - Good, but less efficient compared to native APIs - - Inherent speed advantage with C++, but requires hands-on optimization - * - Flexibility - - Constrained to Hugging Face API - - High flexibility with Python and C++; allows custom coding - * - Learning Curve and Effort - - Lower learning curve; quick to integrate - - Higher learning curve; requires more effort in integration - * - Ideal Use Case - - Ideal for quick prototyping and Python-centric projects - - Best suited for high-performance, resource-optimized production environments - * - Model Serving - - Paid service, based on CPU/GPU usage with Hugging Face - - Free code solution, run script for own server; costs may incur for cloud services - like AWS but generally cheaper than Hugging Face rates +* :doc:`Hugging Face and Optimum Intel <./llm_inference_guide/llm-inference-hf>` + + diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide-npu.rst b/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide-npu.rst index 41e5cbb5733c58..d725b306d57908 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide-npu.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide-npu.rst @@ -1,4 +1,4 @@ -Run LLMs with OpenVINO GenAI Flavor on NPU +Inference with OpenVINO GenAI ========================================== .. meta:: diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide.rst b/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide.rst index f18b66915fc3ce..9998b3989486d2 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide.rst @@ -1,4 +1,4 @@ -Run LLM Inference on OpenVINO with the GenAI Flavor +Inference with OpenVINO GenAI =============================================================================================== .. meta:: @@ -9,39 +9,326 @@ Run LLM Inference on OpenVINO with the GenAI Flavor :hidden: NPU inference of LLMs - genai-guide/genai-use-cases -This guide will show you how to integrate the OpenVINO GenAI flavor into your application, covering -loading a model and passing the input context to receive generated text. Note that the vanilla flavor of OpenVINO -will not work with these instructions, make sure to -:doc:`install OpenVINO GenAI <../../get-started/install-openvino/install-openvino-genai>`. +This article provides reference code and guidance on running generative AI models, +using OpenVINO GenAI. Note that the base OpenVINO version will not work with these instructions, +make sure to :doc:`install OpenVINO GenAI <../../get-started/install-openvino/install-openvino-genai>`. -.. note:: +| Here is sample code for several Generative AI use case scenarios. Note that these are very basic + examples and may need adjustments for your specific needs, like changing the inference device. +| For a more extensive instruction and additional options, see the + `step-by-step chat-bot guide <#chat-bot-use-case-step-by-step>`__ below. - The examples use the CPU as the target device, however, the GPU is also supported. - Note that for the LLM pipeline, the GPU is used only for inference, while token selection, tokenization, and - detokenization remain on the CPU, for efficiency. Tokenizers are represented as a separate model and also run - on the CPU. +.. dropdown:: Text-to-Image Generation -1. Export an LLM model via Hugging Face Optimum-Intel. A chat-tuned TinyLlama model is used in this example: + .. tab-set:: + + .. tab-item:: Python + :sync: python + + .. tab-set:: + + .. tab-item:: main.py + :name: mainpy + + .. code-block:: python + + import openvino_genai + from PIL import Image + import numpy as np + + class Generator(openvino_genai.Generator): + def __init__(self, seed, mu=0.0, sigma=1.0): + openvino_genai.Generator.__init__(self) + np.random.seed(seed) + self.mu = mu + self.sigma = sigma + + def next(self): + return np.random.normal(self.mu, self.sigma) + + + def infer(model_dir: str, prompt: str): + device = 'CPU' # GPU can be used as well + random_generator = Generator(42) + pipe = openvino_genai.Text2ImagePipeline(model_dir, device) + image_tensor = pipe.generate( + prompt, + width=512, + height=512, + num_inference_steps=20, + num_images_per_prompt=1, + random_generator=random_generator + ) + + image = Image.fromarray(image_tensor.data[0]) + image.save("image.bmp") + + .. tab-item:: LoRA.py + :name: lorapy + + .. code-block:: python + + import openvino as ov + import openvino_genai + import numpy as np + import sys + + + class Generator(openvino_genai.Generator): + def __init__(self, seed, mu=0.0, sigma=1.0): + openvino_genai.Generator.__init__(self) + np.random.seed(seed) + self.mu = mu + self.sigma = sigma + + def next(self): + return np.random.normal(self.mu, self.sigma) + + + def image_write(path: str, image_tensor: ov.Tensor): + from PIL import Image + image = Image.fromarray(image_tensor.data[0]) + image.save(path) + + + def infer(models_path: str, prompt: str): + prompt = "cyberpunk cityscape like Tokyo New York with tall buildings at dusk golden hour cinematic lighting" + + device = "CPU" # GPU, NPU can be used as well + adapter_config = openvino_genai.AdapterConfig() + + for i in range(int(len(adapters) / 2)): + adapter = openvino_genai.Adapter(adapters[2 * i]) + alpha = float(adapters[2 * i + 1]) + adapter_config.add(adapter, alpha) + + pipe = openvino_genai.Text2ImagePipeline(models_path, device, adapters=adapter_config) + print("Generating image with LoRA adapters applied, resulting image will be in lora.bmp") + image = pipe.generate(prompt, + random_generator=Generator(42), + width=512, + height=896, + num_inference_steps=20) + + image_write("lora.bmp", image) + print("Generating image without LoRA adapters applied, resulting image will be in baseline.bmp") + image = pipe.generate(prompt, + adapters=openvino_genai.AdapterConfig(), + random_generator=Generator(42), + width=512, + height=896, + num_inference_steps=20 + ) + image_write("baseline.bmp", image) + + For more information, refer to the + `Python sample `__ + + .. tab-item:: C++ + :sync: cpp + + .. tab-set:: + + .. tab-item:: main.cpp + :name: maincpp + + .. code-block:: cpp + + #include "openvino/genai/text2image/pipeline.hpp" + + #include "imwrite.hpp" + + int32_t main(int32_t argc, char* argv[]) try { + OPENVINO_ASSERT(argc == 3, "Usage: ", argv[0], " ''"); + + const std::string models_path = argv[1], prompt = argv[2]; + const std::string device = "CPU"; // GPU, NPU can be used as well + + ov::genai::Text2ImagePipeline pipe(models_path, device); + ov::Tensor image = pipe.generate(prompt, + ov::genai::width(512), + ov::genai::height(512), + ov::genai::num_inference_steps(20), + ov::genai::num_images_per_prompt(1)); + + imwrite("image_%d.bmp", image, true); + + return EXIT_SUCCESS; + } catch (const std::exception& error) { + try { + std::cerr << error.what() << '\n'; + } catch (const std::ios_base::failure&) {} + return EXIT_FAILURE; + } catch (...) { + try { + std::cerr << "Non-exception object thrown\n"; + } catch (const std::ios_base::failure&) {} + return EXIT_FAILURE; + } + + .. tab-item:: LoRA.cpp + :name: loracpp + + .. code-block:: cpp + + #include "openvino/genai/text2image/pipeline.hpp" + + #include "imwrite.hpp" + + int32_t main(int32_t argc, char* argv[]) try { + OPENVINO_ASSERT(argc >= 3 && (argc - 3) % 2 == 0, "Usage: ", argv[0], " '' [ ...]]"); + + const std::string models_path = argv[1], prompt = argv[2]; + const std::string device = "CPU"; // GPU, NPU can be used as well - .. code-block:: python + ov::genai::AdapterConfig adapter_config; + for(size_t i = 0; i < (argc - 3)/2; ++i) { + ov::genai::Adapter adapter(argv[3 + 2*i]); + float alpha = std::atof(argv[3 + 2*i + 1]); + adapter_config.add(adapter, alpha); + } - optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --weight-format fp16 --trust-remote-code "TinyLlama-1.1B-Chat-v1.0" + ov::genai::Text2ImagePipeline pipe(models_path, device, ov::genai::adapters(adapter_config)); - *Optional*. Optimize the model: + std::cout << "Generating image with LoRA adapters applied, resulting image will be in lora.bmp\n"; + ov::Tensor image = pipe.generate(prompt, + ov::genai::random_generator(std::make_shared(42)), + ov::genai::width(512), + ov::genai::height(896), + ov::genai::num_inference_steps(20)); + imwrite("lora.bmp", image, true); - The model is an optimized OpenVINO IR with FP16 precision. For enhanced LLM performance, - it is recommended to use lower precision for model weights, such as INT4, and to compress weights - using NNCF during model export directly: + std::cout << "Generating image without LoRA adapters applied, resulting image will be in baseline.bmp\n"; + image = pipe.generate(prompt, + ov::genai::adapters(), + ov::genai::random_generator(std::make_shared(42)), + ov::genai::width(512), + ov::genai::height(896), + ov::genai::num_inference_steps(20)); + imwrite("baseline.bmp", image, true); - .. code-block:: python + return EXIT_SUCCESS; + } catch (const std::exception& error) { + try { + std::cerr << error.what() << '\n'; + } catch (const std::ios_base::failure&) {} + return EXIT_FAILURE; + } catch (...) { + try { + std::cerr << "Non-exception object thrown\n"; + } catch (const std::ios_base::failure&) {} + return EXIT_FAILURE; + } - optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --weight-format int4 --trust-remote-code "TinyLlama-1.1B-Chat-v1.0" + For more information, refer to the + `C++ sample `__ -2. Perform generation using the new GenAI API: +.. dropdown:: Speech Recognition + + The application performs inference on speech recognition Whisper Models. The samples include + the ``WhisperPipeline`` class and use audio files in WAV format at a sampling rate of 16 kHz + as input. + + .. tab-set:: + + .. tab-item:: Python + :sync: cpp + + .. code-block:: python + + import openvino_genai + import librosa + + + def read_wav(filepath): + raw_speech, samplerate = librosa.load(filepath, sr=16000) + return raw_speech.tolist() + + + def infer(model_dir: str, wav_file_path: str): + device = "CPU" # GPU or NPU can be used as well. + pipe = openvino_genai.WhisperPipeline(model_dir, device) + + # The pipeline expects normalized audio with a sampling rate of 16kHz. + raw_speech = read_wav(wav_file_path) + result = pipe.generate( + raw_speech, + max_new_tokens=100, + language="<|en|>", + task="transcribe", + return_timestamps=True, + ) + + print(result) + + for chunk in result.chunks: + print(f"timestamps: [{chunk.start_ts}, {chunk.end_ts}] text: {chunk.text}") + + + For more information, refer to the + `Python sample `__. + + .. tab-item:: C++ + :sync: cpp + + .. code-block:: cpp + + #include "audio_utils.hpp" + #include "openvino/genai/whisper_pipeline.hpp" + + int main(int argc, char* argv[]) try { + if (3 > argc) { + throw std::runtime_error(std::string{"Usage: "} + argv[0] + " \"\""); + } + + std::filesystem::path models_path = argv[1]; + std::string wav_file_path = argv[2]; + std::string device = "CPU"; // GPU or NPU can be used as well. + + ov::genai::WhisperPipeline pipeline(models_path, device); + + ov::genai::WhisperGenerationConfig config(models_path / "generation_config.json"); + config.max_new_tokens = 100; + config.language = "<|en|>"; + config.task = "transcribe"; + config.return_timestamps = true; + + // The pipeline expects normalized audio with a sampling rate of 16kHz. + ov::genai::RawSpeechInput raw_speech = utils::audio::read_wav(wav_file_path); + auto result = pipeline.generate(raw_speech, config); + + std::cout << result << "\n"; + + for (auto& chunk : *result.chunks) { + std::cout << "timestamps: [" << chunk.start_ts << ", " << chunk.end_ts << "] text: " << chunk.text << "\n"; + } + + } catch (const std::exception& error) { + try { + std::cerr << error.what() << '\n'; + } catch (const std::ios_base::failure&) { + } + return EXIT_FAILURE; + } catch (...) { + try { + std::cerr << "Non-exception object thrown\n"; + } catch (const std::ios_base::failure&) { + } + return EXIT_FAILURE; + } + + For more information, refer to the + `C++ sample `__. + + +.. dropdown:: Using GenAI in Chat Scenario + + For chat scenarios where inputs and outputs represent a conversation, maintaining KVCache + across inputs may prove beneficial. The ``start_chat`` and ``finish_chat`` chat-specific + methods are used to mark a conversation session, as shown in the samples below: .. tab-set:: @@ -50,9 +337,35 @@ will not work with these instructions, make sure to .. code-block:: python - import openvino_genai as ov_genai - pipe = ov_genai.LLMPipeline(model_path, "CPU") - print(pipe.generate("The Sun is yellow because", max_new_tokens=100)) + import openvino_genai + + + def streamer(subword): + print(subword, end='', flush=True) + return False + + + def infer(model_dir: str): + device = 'CPU' # GPU can be used as well. + pipe = openvino_genai.LLMPipeline(model_dir, device) + + config = openvino_genai.GenerationConfig() + config.max_new_tokens = 100 + + pipe.start_chat() + while True: + try: + prompt = input('question:\n') + except EOFError: + break + pipe.generate(prompt, config, streamer) + print('\n----------') + pipe.finish_chat() + + + + For more information, refer to the + `Python sample `__. .. tab-item:: C++ :sync: cpp @@ -60,27 +373,250 @@ will not work with these instructions, make sure to .. code-block:: cpp #include "openvino/genai/llm_pipeline.hpp" - #include - int main(int argc, char* argv[]) { - std::string model_path = argv[1]; - ov::genai::LLMPipeline pipe(model_path, "CPU"); - std::cout << pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(100)); + int main(int argc, char* argv[]) try { + if (2 != argc) { + throw std::runtime_error(std::string{"Usage: "} + argv[0] + " "); + } + std::string prompt; + std::string models_path = argv[1]; + + std::string device = "CPU"; // GPU, NPU can be used as well + ov::genai::LLMPipeline pipe(models_path, device); + + ov::genai::GenerationConfig config; + config.max_new_tokens = 100; + std::function streamer = [](std::string word) { + std::cout << word << std::flush; + return false; + }; + + pipe.start_chat(); + std::cout << "question:\n"; + while (std::getline(std::cin, prompt)) { + pipe.generate(prompt, config, streamer); + std::cout << "\n----------\n" + "question:\n"; + } + pipe.finish_chat(); + } catch (const std::exception& error) { + try { + std::cerr << error.what() << '\n'; + } catch (const std::ios_base::failure&) {} + return EXIT_FAILURE; + } catch (...) { + try { + std::cerr << "Non-exception object thrown\n"; + } catch (const std::ios_base::failure&) {} + return EXIT_FAILURE; } -The `LLMPipeline` is the main object used for decoding. You can construct it directly from the -folder with the converted model. It will automatically load the main model, tokenizer, detokenizer, -and the default generation configuration. -Once the model is exported from Hugging Face Optimum-Intel, it already contains all the information -necessary for execution, including the tokenizer/detokenizer and the generation config, ensuring that -its results match those generated by Hugging Face. + For more information, refer to the + `C++ sample `__ + + +.. dropdown:: Using GenAI with Vision Language Models + + OpenVINO GenAI introduces the ``openvino_genai.VLMPipeline`` pipeline for + inference of multimodal text-generation Vision Language Models (VLMs). + With a text prompt and an image as input, VLMPipeline can generate text using + models such as LLava or MiniCPM-V. See the chat scenario presented + in the samples below: + + .. tab-set:: + + .. tab-item:: Python + :sync: py + + .. code-block:: python + + import numpy as np + import openvino_genai + from PIL import Image + from openvino import Tensor + from pathlib import Path + + + def streamer(subword: str) -> bool: + print(subword, end='', flush=True) + + + def read_image(path: str) -> Tensor: + pic = Image.open(path).convert("RGB") + image_data = np.array(pic.getdata()).reshape(1, pic.size[1], pic.size[0], 3).astype(np.uint8) + return Tensor(image_data) + + + def read_images(path: str) -> list[Tensor]: + entry = Path(path) + if entry.is_dir(): + return [read_image(str(file)) for file in sorted(entry.iterdir())] + return [read_image(path)] + + + def infer(model_dir: str, image_dir: str): + rgbs = read_images(image_dir) + device = 'CPU' # GPU can be used as well. + enable_compile_cache = dict() + if "GPU" == device: + enable_compile_cache["CACHE_DIR"] = "vlm_cache" + pipe = openvino_genai.VLMPipeline(model_dir, device, **enable_compile_cache) + + config = openvino_genai.GenerationConfig() + config.max_new_tokens = 100 + + pipe.start_chat() + prompt = input('question:\n') + pipe.generate(prompt, images=rgbs, generation_config=config, streamer=streamer) + + while True: + try: + prompt = input("\n----------\n" + "question:\n") + except EOFError: + break + pipe.generate(prompt, generation_config=config, streamer=streamer) + pipe.finish_chat() + + + For more information, refer to the + `Python sample `__. + + .. tab-item:: C++ + :sync: cpp + + .. code-block:: cpp + + #include "load_image.hpp" + #include + #include + + bool print_subword(std::string&& subword) { + return !(std::cout << subword << std::flush); + } + + int main(int argc, char* argv[]) try { + if (3 != argc) { + throw std::runtime_error(std::string{"Usage "} + argv[0] + " "); + } + + std::vector rgbs = utils::load_images(argv[2]); + + std::string device = "CPU"; // GPU can be used as well. + ov::AnyMap enable_compile_cache; + if ("GPU" == device) { + enable_compile_cache.insert({ov::cache_dir("vlm_cache")}); + } + ov::genai::VLMPipeline pipe(argv[1], device, enable_compile_cache); + + ov::genai::GenerationConfig generation_config; + generation_config.max_new_tokens = 100; + + std::string prompt; + + pipe.start_chat(); + std::cout << "question:\n"; + + std::getline(std::cin, prompt); + pipe.generate(prompt, + ov::genai::images(rgbs), + ov::genai::generation_config(generation_config), + ov::genai::streamer(print_subword)); + std::cout << "\n----------\n" + "question:\n"; + while (std::getline(std::cin, prompt)) { + pipe.generate(prompt, + ov::genai::generation_config(generation_config), + ov::genai::streamer(print_subword)); + std::cout << "\n----------\n" + "question:\n"; + } + pipe.finish_chat(); + } catch (const std::exception& error) { + try { + std::cerr << error.what() << '\n'; + } catch (const std::ios_base::failure&) {} + return EXIT_FAILURE; + } catch (...) { + try { + std::cerr << "Non-exception object thrown\n"; + } catch (const std::ios_base::failure&) {} + return EXIT_FAILURE; + } + + + For more information, refer to the + `C++ sample `__ + + +| + + +Chat-bot use case - step by step +############################################################################################### + +This example will show you how to create a chat-bot functionality, using the ``ov_genai.LLMPipeline`` +and a chat-tuned TinyLlama model. Apart from the basic implementation, it provides additional +optimization methods. + +Although CPU is used as inference device in the samples below, you may choose GPU instead. +Note that tasks such as token selection, tokenization, and detokenization are always handled +by CPU only. Tokenizers, represented as a separate model, are also run on CPU. + +Running the model ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +You start with exporting an LLM model via Hugging Face Optimum-Intel. Note that the precision +of ``int4`` is used, instead of the original ``fp16``, for better performance. The weight +compression is done by NNCF at the model export stage. The exported model contains all the +information necessary for execution, including the tokenizer/detokenizer and the generation +config, ensuring that its results match those generated by Hugging Face. + +The `LLMPipeline` is the main object used for decoding and handles all the necessary steps. +You can construct it directly from the folder with the converted model. + + +.. tab-set:: + + .. tab-item:: Python + :sync: py + + .. code-block:: console + + optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --weight-format int4 --trust-remote-code "TinyLlama-1.1B-Chat-v1.0" + + .. code-block:: python + + import openvino_genai as ov_genai + pipe = ov_genai.LLMPipeline(model_path, "CPU") + print(pipe.generate("The Sun is yellow because", max_new_tokens=100)) + + .. tab-item:: C++ + :sync: cpp + + .. code-block:: console + + optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --weight-format int4 --trust-remote-code "TinyLlama-1.1B-Chat-v1.0" + + .. code-block:: cpp + + #include "openvino/genai/llm_pipeline.hpp" + #include + + int main(int argc, char* argv[]) { + std::string model_path = argv[1]; + ov::genai::LLMPipeline pipe(model_path, "CPU"); + std::cout << pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(100)); + } + + Streaming the Output -########################### ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -For more interactive UIs during generation, streaming of model output tokens is supported. See the example -below, where a lambda function outputs words to the console immediately upon generation: +For more interactive UIs during generation, you can stream output tokens. In this example, a +lambda function outputs words to the console immediately upon generation: .. tab-set:: @@ -177,12 +713,10 @@ You can also create your custom streamer for more sophisticated processing: Optimizing Generation with Grouped Beam Search -####################################################### - -Leverage grouped beam search decoding and configure generation_config for better text generation -quality and efficient batch processing in GenAI applications. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -Specify generation_config to use grouped beam search: +For better text generation quality and more efficient batch processing, specify +``generation_config`` to leverage grouped beam search decoding. .. tab-set:: @@ -218,22 +752,19 @@ Specify generation_config to use grouped beam search: cout << pipe.generate("The Sun is yellow because", config); } + Efficient Text Generation via Speculative Decoding -################################################## +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Speculative decoding (or assisted-generation) enables faster token generation -when an additional smaller draft model is used alongside the main model. -The draft model predicts the next K tokens one by one in an autoregressive manner, -while the main model validates these predictions and corrects them if necessary. +when an additional smaller draft model is used alongside the main model. This reduces the +number of infer requests to the main model, increasing performance. -Each predicted token is compared, and when there is a difference between the draft and -main model, the last token predicted by the main model is kept. Then, the draft -model acquires this token and tries prediction of the next K tokens, -thus repeating the cycle. +The draft model predicts the next K tokens one by one in an autoregressive manner. The main +model validates these predictions and corrects them if necessary - in case of +a discrepancy, the main model prediction is used. Then, the draft model acquires this token and +runs prediction of the next K tokens, thus repeating the cycle. -This method eliminates the need for multiple infer requests to the main model, -which results in increased performance. Its implementation in the pipeline is -shown in the code samples below: .. tab-set:: @@ -265,7 +796,7 @@ shown in the code samples below: config.max_new_tokens = 100 config.num_assistant_tokens = 5 - pipe.generate(prompt, config, streamer) + pipe.generate("The Sun is yellow because", config, streamer) For more information, refer to the @@ -310,7 +841,7 @@ shown in the code samples below: return false; }; - pipe.generate(prompt, config, streamer); + pipe.generate("The Sun is yellow because", config, streamer); } catch (const std::exception& error) { try { std::cerr << error.what() << '\n'; @@ -327,10 +858,18 @@ shown in the code samples below: For more information, refer to the `C++ sample `__ + + + + + + + Comparing with Hugging Face Results ####################################### -Compare and analyze results with those generated by Hugging Face models. +You can compare the results of the above example with those generated by Hugging Face models by +running the following code: .. tab-set:: @@ -358,30 +897,35 @@ Compare and analyze results with those generated by Hugging Face models. assert hf_output == ov_output -GenAI API -####################################### -OpenVINO GenAI Flavor includes the following API: -* generation_config - defines a configuration class for text generation, enabling customization of the generation process such as the maximum length of the generated text, whether to ignore end-of-sentence tokens, and the specifics of the decoding strategy (greedy, beam search, or multinomial sampling). -* llm_pipeline - provides classes and utilities for text generation, including a pipeline for processing inputs, generating text, and managing outputs with configurable options. -* streamer_base - an abstract base class for creating streamers. -* tokenizer - the tokenizer class for text encoding and decoding. +GenAI API +####################################### +The use case described here uses the following OpenVINO GenAI API methods: + +* generation_config - defines a configuration class for text generation, + enabling customization of the generation process such as the maximum length of + the generated text, whether to ignore end-of-sentence tokens, and the specifics + of the decoding strategy (greedy, beam search, or multinomial sampling). +* llm_pipeline - provides classes and utilities for processing inputs, + text generation, and managing outputs with configurable options. +* streamer_base - an abstract base class for creating streamers. +* tokenizer - the tokenizer class for text encoding and decoding. * visibility - controls the visibility of the GenAI library. -Learn more in the `GenAI API reference `__. +Learn more from the `GenAI API reference `__. Additional Resources #################### * `OpenVINO GenAI Repo `__ * `OpenVINO GenAI Samples `__ +* A Jupyter notebook demonstrating + `Visual-language assistant with MiniCPM-V2 and OpenVINO `__ * `OpenVINO Tokenizers `__ * `Neural Network Compression Framework `__ - - diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide/genai-use-cases.rst b/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide/genai-use-cases.rst deleted file mode 100644 index 245a2648aab491..00000000000000 --- a/docs/articles_en/learn-openvino/llm_inference_guide/genai-guide/genai-use-cases.rst +++ /dev/null @@ -1,563 +0,0 @@ -GenAI Use Cases -===================== - -This article provides several use case scenarios for Generative AI model -inference. The applications presented in the code samples below -only require minimal configuration, like setting an inference device. Feel free -to explore and modify the source code as you need. - - -Using GenAI for Text-to-Image Generation -######################################## - -Examples below demonstrate inference on text-to-image models, like Stable Diffusion -1.5, 2.1, and LCM, with a text prompt as input. The :ref:`main.cpp ` -sample shows basic usage of the ``Text2ImagePipeline`` pipeline. -:ref:`lora.cpp ` shows how to apply LoRA adapters to the pipeline. - - -.. tab-set:: - - .. tab-item:: Python - :sync: python - - .. tab-set:: - - .. tab-item:: main.py - :name: mainpy - - .. code-block:: python - - import openvino_genai - from PIL import Image - import numpy as np - - class Generator(openvino_genai.Generator): - def __init__(self, seed, mu=0.0, sigma=1.0): - openvino_genai.Generator.__init__(self) - np.random.seed(seed) - self.mu = mu - self.sigma = sigma - - def next(self): - return np.random.normal(self.mu, self.sigma) - - - def infer(model_dir: str, prompt: str): - device = 'CPU' # GPU can be used as well - random_generator = Generator(42) - pipe = openvino_genai.Text2ImagePipeline(model_dir, device) - image_tensor = pipe.generate( - prompt, - width=512, - height=512, - num_inference_steps=20, - num_images_per_prompt=1, - random_generator=random_generator - ) - - image = Image.fromarray(image_tensor.data[0]) - image.save("image.bmp") - - .. tab-item:: LoRA.py - :name: lorapy - - .. code-block:: python - - import openvino as ov - import openvino_genai - import numpy as np - import sys - - - class Generator(openvino_genai.Generator): - def __init__(self, seed, mu=0.0, sigma=1.0): - openvino_genai.Generator.__init__(self) - np.random.seed(seed) - self.mu = mu - self.sigma = sigma - - def next(self): - return np.random.normal(self.mu, self.sigma) - - - def image_write(path: str, image_tensor: ov.Tensor): - from PIL import Image - image = Image.fromarray(image_tensor.data[0]) - image.save(path) - - - def infer(models_path: str, prompt: str): - prompt = "cyberpunk cityscape like Tokyo New York with tall buildings at dusk golden hour cinematic lighting" - - device = "CPU" # GPU, NPU can be used as well - adapter_config = openvino_genai.AdapterConfig() - - for i in range(int(len(adapters) / 2)): - adapter = openvino_genai.Adapter(adapters[2 * i]) - alpha = float(adapters[2 * i + 1]) - adapter_config.add(adapter, alpha) - - pipe = openvino_genai.Text2ImagePipeline(models_path, device, adapters=adapter_config) - print("Generating image with LoRA adapters applied, resulting image will be in lora.bmp") - image = pipe.generate(prompt, - random_generator=Generator(42), - width=512, - height=896, - num_inference_steps=20) - - image_write("lora.bmp", image) - print("Generating image without LoRA adapters applied, resulting image will be in baseline.bmp") - image = pipe.generate(prompt, - adapters=openvino_genai.AdapterConfig(), - random_generator=Generator(42), - width=512, - height=896, - num_inference_steps=20 - ) - image_write("baseline.bmp", image) - - For more information, refer to the - `Python sample `__ - - .. tab-item:: C++ - :sync: cpp - - .. tab-set:: - - .. tab-item:: main.cpp - :name: maincpp - - .. code-block:: cpp - - #include "openvino/genai/text2image/pipeline.hpp" - - #include "imwrite.hpp" - - int32_t main(int32_t argc, char* argv[]) try { - OPENVINO_ASSERT(argc == 3, "Usage: ", argv[0], " ''"); - - const std::string models_path = argv[1], prompt = argv[2]; - const std::string device = "CPU"; // GPU, NPU can be used as well - - ov::genai::Text2ImagePipeline pipe(models_path, device); - ov::Tensor image = pipe.generate(prompt, - ov::genai::width(512), - ov::genai::height(512), - ov::genai::num_inference_steps(20), - ov::genai::num_images_per_prompt(1)); - - imwrite("image_%d.bmp", image, true); - - return EXIT_SUCCESS; - } catch (const std::exception& error) { - try { - std::cerr << error.what() << '\n'; - } catch (const std::ios_base::failure&) {} - return EXIT_FAILURE; - } catch (...) { - try { - std::cerr << "Non-exception object thrown\n"; - } catch (const std::ios_base::failure&) {} - return EXIT_FAILURE; - } - - .. tab-item:: LoRA.cpp - :name: loracpp - - .. code-block:: cpp - - #include "openvino/genai/text2image/pipeline.hpp" - - #include "imwrite.hpp" - - int32_t main(int32_t argc, char* argv[]) try { - OPENVINO_ASSERT(argc >= 3 && (argc - 3) % 2 == 0, "Usage: ", argv[0], " '' [ ...]]"); - - const std::string models_path = argv[1], prompt = argv[2]; - const std::string device = "CPU"; // GPU, NPU can be used as well - - ov::genai::AdapterConfig adapter_config; - for(size_t i = 0; i < (argc - 3)/2; ++i) { - ov::genai::Adapter adapter(argv[3 + 2*i]); - float alpha = std::atof(argv[3 + 2*i + 1]); - adapter_config.add(adapter, alpha); - } - - ov::genai::Text2ImagePipeline pipe(models_path, device, ov::genai::adapters(adapter_config)); - - std::cout << "Generating image with LoRA adapters applied, resulting image will be in lora.bmp\n"; - ov::Tensor image = pipe.generate(prompt, - ov::genai::random_generator(std::make_shared(42)), - ov::genai::width(512), - ov::genai::height(896), - ov::genai::num_inference_steps(20)); - imwrite("lora.bmp", image, true); - - std::cout << "Generating image without LoRA adapters applied, resulting image will be in baseline.bmp\n"; - image = pipe.generate(prompt, - ov::genai::adapters(), - ov::genai::random_generator(std::make_shared(42)), - ov::genai::width(512), - ov::genai::height(896), - ov::genai::num_inference_steps(20)); - imwrite("baseline.bmp", image, true); - - return EXIT_SUCCESS; - } catch (const std::exception& error) { - try { - std::cerr << error.what() << '\n'; - } catch (const std::ios_base::failure&) {} - return EXIT_FAILURE; - } catch (...) { - try { - std::cerr << "Non-exception object thrown\n"; - } catch (const std::ios_base::failure&) {} - return EXIT_FAILURE; - } - - - For more information, refer to the - `C++ sample `__ - - - - - -Using GenAI in Speech Recognition -################################# - - -The application, shown in code samples below, performs inference on speech -recognition Whisper Models. The samples include the ``WhisperPipeline`` class -and use audio files in WAV format at a sampling rate of 16 kHz as input. - -.. tab-set:: - - .. tab-item:: Python - :sync: cpp - - .. code-block:: python - - import openvino_genai - import librosa - - - def read_wav(filepath): - raw_speech, samplerate = librosa.load(filepath, sr=16000) - return raw_speech.tolist() - - - def infer(model_dir: str, wav_file_path: str): - device = "CPU" # GPU or NPU can be used as well. - pipe = openvino_genai.WhisperPipeline(model_dir, device) - - # The pipeline expects normalized audio with a sampling rate of 16kHz. - raw_speech = read_wav(wav_file_path) - result = pipe.generate( - raw_speech, - max_new_tokens=100, - language="<|en|>", - task="transcribe", - return_timestamps=True, - ) - - print(result) - - for chunk in result.chunks: - print(f"timestamps: [{chunk.start_ts}, {chunk.end_ts}] text: {chunk.text}") - - - For more information, refer to the - `Python sample `__. - - .. tab-item:: C++ - :sync: cpp - - .. code-block:: cpp - - #include "audio_utils.hpp" - #include "openvino/genai/whisper_pipeline.hpp" - - int main(int argc, char* argv[]) try { - if (3 > argc) { - throw std::runtime_error(std::string{"Usage: "} + argv[0] + " \"\""); - } - - std::filesystem::path models_path = argv[1]; - std::string wav_file_path = argv[2]; - std::string device = "CPU"; // GPU or NPU can be used as well. - - ov::genai::WhisperPipeline pipeline(models_path, device); - - ov::genai::WhisperGenerationConfig config(models_path / "generation_config.json"); - config.max_new_tokens = 100; - config.language = "<|en|>"; - config.task = "transcribe"; - config.return_timestamps = true; - - // The pipeline expects normalized audio with a sampling rate of 16kHz. - ov::genai::RawSpeechInput raw_speech = utils::audio::read_wav(wav_file_path); - auto result = pipeline.generate(raw_speech, config); - - std::cout << result << "\n"; - - for (auto& chunk : *result.chunks) { - std::cout << "timestamps: [" << chunk.start_ts << ", " << chunk.end_ts << "] text: " << chunk.text << "\n"; - } - - } catch (const std::exception& error) { - try { - std::cerr << error.what() << '\n'; - } catch (const std::ios_base::failure&) { - } - return EXIT_FAILURE; - } catch (...) { - try { - std::cerr << "Non-exception object thrown\n"; - } catch (const std::ios_base::failure&) { - } - return EXIT_FAILURE; - } - - - For more information, refer to the - `C++ sample `__. - - -Using GenAI in Chat Scenario -############################ - -For chat scenarios where inputs and outputs represent a conversation, maintaining KVCache across inputs -may prove beneficial. The ``start_chat`` and ``finish_chat`` chat-specific methods are used to -mark a conversation session, as shown in the samples below: - -.. tab-set:: - - .. tab-item:: Python - :sync: py - - .. code-block:: python - - import openvino_genai - - - def streamer(subword): - print(subword, end='', flush=True) - return False - - - def infer(model_dir: str): - device = 'CPU' # GPU can be used as well. - pipe = openvino_genai.LLMPipeline(model_dir, device) - - config = openvino_genai.GenerationConfig() - config.max_new_tokens = 100 - - pipe.start_chat() - while True: - try: - prompt = input('question:\n') - except EOFError: - break - pipe.generate(prompt, config, streamer) - print('\n----------') - pipe.finish_chat() - - - - For more information, refer to the - `Python sample `__. - - .. tab-item:: C++ - :sync: cpp - - .. code-block:: cpp - - #include "openvino/genai/llm_pipeline.hpp" - - int main(int argc, char* argv[]) try { - if (2 != argc) { - throw std::runtime_error(std::string{"Usage: "} + argv[0] + " "); - } - std::string prompt; - std::string models_path = argv[1]; - - std::string device = "CPU"; // GPU, NPU can be used as well - ov::genai::LLMPipeline pipe(models_path, device); - - ov::genai::GenerationConfig config; - config.max_new_tokens = 100; - std::function streamer = [](std::string word) { - std::cout << word << std::flush; - return false; - }; - - pipe.start_chat(); - std::cout << "question:\n"; - while (std::getline(std::cin, prompt)) { - pipe.generate(prompt, config, streamer); - std::cout << "\n----------\n" - "question:\n"; - } - pipe.finish_chat(); - } catch (const std::exception& error) { - try { - std::cerr << error.what() << '\n'; - } catch (const std::ios_base::failure&) {} - return EXIT_FAILURE; - } catch (...) { - try { - std::cerr << "Non-exception object thrown\n"; - } catch (const std::ios_base::failure&) {} - return EXIT_FAILURE; - } - - - For more information, refer to the - `C++ sample `__ - - -Using GenAI with Vision Language Models -####################################### - -OpenVINO GenAI introduces the ``openvino_genai.VLMPipeline`` pipeline for -inference of multimodal text-generation Vision Language Models (VLMs). -With a text prompt and an image as input, VLMPipeline can generate text using -models such as LLava or MiniCPM-V. See the chat scenario presented -in the samples below: - -.. tab-set:: - - .. tab-item:: Python - :sync: py - - .. code-block:: python - - import numpy as np - import openvino_genai - from PIL import Image - from openvino import Tensor - from pathlib import Path - - - def streamer(subword: str) -> bool: - print(subword, end='', flush=True) - - - def read_image(path: str) -> Tensor: - pic = Image.open(path).convert("RGB") - image_data = np.array(pic.getdata()).reshape(1, pic.size[1], pic.size[0], 3).astype(np.uint8) - return Tensor(image_data) - - - def read_images(path: str) -> list[Tensor]: - entry = Path(path) - if entry.is_dir(): - return [read_image(str(file)) for file in sorted(entry.iterdir())] - return [read_image(path)] - - - def infer(model_dir: str, image_dir: str): - rgbs = read_images(image_dir) - device = 'CPU' # GPU can be used as well. - enable_compile_cache = dict() - if "GPU" == device: - enable_compile_cache["CACHE_DIR"] = "vlm_cache" - pipe = openvino_genai.VLMPipeline(model_dir, device, **enable_compile_cache) - - config = openvino_genai.GenerationConfig() - config.max_new_tokens = 100 - - pipe.start_chat() - prompt = input('question:\n') - pipe.generate(prompt, images=rgbs, generation_config=config, streamer=streamer) - - while True: - try: - prompt = input("\n----------\n" - "question:\n") - except EOFError: - break - pipe.generate(prompt, generation_config=config, streamer=streamer) - pipe.finish_chat() - - - For more information, refer to the - `Python sample `__. - - .. tab-item:: C++ - :sync: cpp - - .. code-block:: cpp - - #include "load_image.hpp" - #include - #include - - bool print_subword(std::string&& subword) { - return !(std::cout << subword << std::flush); - } - - int main(int argc, char* argv[]) try { - if (3 != argc) { - throw std::runtime_error(std::string{"Usage "} + argv[0] + " "); - } - - std::vector rgbs = utils::load_images(argv[2]); - - std::string device = "CPU"; // GPU can be used as well. - ov::AnyMap enable_compile_cache; - if ("GPU" == device) { - enable_compile_cache.insert({ov::cache_dir("vlm_cache")}); - } - ov::genai::VLMPipeline pipe(argv[1], device, enable_compile_cache); - - ov::genai::GenerationConfig generation_config; - generation_config.max_new_tokens = 100; - - std::string prompt; - - pipe.start_chat(); - std::cout << "question:\n"; - - std::getline(std::cin, prompt); - pipe.generate(prompt, - ov::genai::images(rgbs), - ov::genai::generation_config(generation_config), - ov::genai::streamer(print_subword)); - std::cout << "\n----------\n" - "question:\n"; - while (std::getline(std::cin, prompt)) { - pipe.generate(prompt, - ov::genai::generation_config(generation_config), - ov::genai::streamer(print_subword)); - std::cout << "\n----------\n" - "question:\n"; - } - pipe.finish_chat(); - } catch (const std::exception& error) { - try { - std::cerr << error.what() << '\n'; - } catch (const std::ios_base::failure&) {} - return EXIT_FAILURE; - } catch (...) { - try { - std::cerr << "Non-exception object thrown\n"; - } catch (const std::ios_base::failure&) {} - return EXIT_FAILURE; - } - - - For more information, refer to the - `C++ sample `__ - -Additional Resources -##################### - -* :doc:`Install OpenVINO GenAI <../../../get-started/install-openvino/install-openvino-genai>` -* `OpenVINO GenAI Repo `__ -* `OpenVINO GenAI Samples `__ -* A Jupyter notebook demonstrating - `Visual-language assistant with MiniCPM-V2 and OpenVINO `__ -* `OpenVINO Tokenizers `__ diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst index 7bf2107482bd3a..4fec1acd23e6a7 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-hf.rst @@ -1,4 +1,4 @@ -Run LLMs with Hugging Face and Optimum Intel +Inference with Optimum Intel =============================================================================================== .. meta:: diff --git a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-native-ov.rst b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-native-ov.rst index 2476a0423e30e1..d33ae05f68f462 100644 --- a/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-native-ov.rst +++ b/docs/articles_en/learn-openvino/llm_inference_guide/llm-inference-native-ov.rst @@ -1,4 +1,4 @@ -Run LLM Inference on Native OpenVINO (not recommended) +Generative AI with Base OpenVINO (not recommended) =============================================================================================== To run Generative AI models using native OpenVINO APIs you need to follow regular diff --git a/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf b/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf new file mode 100644 index 00000000000000..5b6178d85c504b Binary files /dev/null and b/docs/sphinx_setup/_static/download/GenAI_Quick_Start_Guide.pdf differ diff --git a/docs/sphinx_setup/index.rst b/docs/sphinx_setup/index.rst index 2e6f960468015f..4da0aa8f29535c 100644 --- a/docs/sphinx_setup/index.rst +++ b/docs/sphinx_setup/index.rst @@ -11,8 +11,8 @@ generative AI, video, audio, and language with models from popular frameworks li TensorFlow, ONNX, and more. Convert and optimize models, and deploy across a mix of Intel® hardware and environments, on-premises and on-device, in the browser or in the cloud. -Check out the `OpenVINO Cheat Sheet. `__ - +Check out the `OpenVINO Cheat Sheet [PDF] `__ +Check out the `GenAI Quick-start Guide [PDF] `__ .. container::