[Draft]Add Multimodal RAG notebook (#2497)

![image](https://github.com/user-attachments/assets/a8ebf3fc-7a34-416b-b744-609965792744) ![image](https://github.com/user-attachments/assets/b6f97e32-d567-4278-afac-a633776b463d) --------- Co-authored-by: Ekaterina Aidova <[email protected]>
openvinotoolkit · Dec 23, 2024 · 41280d6 · 41280d6
1 parent 4c25688
commit 41280d6
Show file tree

Hide file tree

Showing 6 changed files with 1,257 additions and 1 deletion.
diff --git a/.ci/ignore_treon_docker.txt b/.ci/ignore_treon_docker.txt
@@ -81,4 +81,5 @@ notebooks/sam2-image-segmentation/segment-anything-2-image.ipynb
 notebooks/pixtral/pixtral.ipynb
 notebooks/llm-agent-react/llm-agent-react.ipynb
 notebooks/multilora-image-generation/multilora-image-generation.ipynb
-notebooks/llm-agent-react/llm-agent-react-langchain.ipynb
+notebooks/llm-agent-react/llm-agent-react-langchain.ipynb
+notebooks/multimodal-rag/multimodal-rag-llamaindex.ipynb
diff --git a/.ci/skipped_notebooks.yml b/.ci/skipped_notebooks.yml
@@ -559,6 +559,10 @@
         - ubuntu-22.04
         - windows-2019
 - notebook: notebooks/glm-edge-v/glm-edge-v.ipynb
+  skips:
+    - os:
+        - macos-13
+- notebook: notebooks/multimodal-rag/multimodal-rag-llamaindex.ipynb
   skips:
     - os:
         - macos-13
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -613,6 +613,7 @@ OpenVINO
 openvino
 OpenVino
 OpenVINO's
+OpenVINOMultiModal
 openvoice
 OpenVoice
 OpenVoiceBaseClass
@@ -996,6 +997,7 @@ VITS
 vitt
 VL
 vl
+VLM
 VLModel
 VM
 Vladlen
@@ -1042,6 +1044,7 @@ YOLOv
 yolov
 Youri
 youri
+YouTube
 Zafrir
 ZavyChromaXL
 Zongyuan

diff --git a/notebooks/multimodal-rag/README.md b/notebooks/multimodal-rag/README.md
@@ -0,0 +1,27 @@
+# Multimodal RAG for video analytics with LlamaIndex
+
+Constructing a RAG pipeline for text is relatively straightforward, thanks to the tools developed for parsing, indexing, and retrieving text data. However, adapting RAG models for video content presents a greater challenge. Videos combine visual, auditory, and textual elements, requiring more processing power and sophisticated video pipelines.
+
+To build a truly multimodal search for videos, you need to work with different modalities of a video like spoken content, visual. In this notebook, we showcase a Multimodal RAG pipeline designed for video analytics. It utilizes Whisper model to convert spoken content to text, CLIP model to generate multimodal embeddings, and Vision Language model (VLM) to process retrieved images and text messages. The following picture illustrates how this pipeline is working.
+
+![image](https://github.com/user-attachments/assets/a8ebf3fc-7a34-416b-b744-609965792744)
+
+## Notebook contents
+The tutorial consists from following steps:
+
+- Install requirements
+- Convert and Optimize model
+- Download and process video
+- Create the multi-modal index
+- Search text and image embeddings
+- Generate final response using VLM
+- Launch Interactive demo
+
+In this demonstration, you'll create interactive Q&A system that can answer questions about provided video's content.
+
+## Installation instructions
+This is a self-contained example that relies solely on its own code.</br>
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).
+
+<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/multimodal-rag/README.md" />
diff --git a/notebooks/multimodal-rag/gradio_helper.py b/notebooks/multimodal-rag/gradio_helper.py
@@ -0,0 +1,129 @@
+from typing import Callable
+import gradio as gr
+
+examples = [
+    ["Tell me more about gaussian function"],
+    ["Explain the formula of gaussian function to me"],
+    ["What is the Herschel Maxwell derivation of a Gaussian ?"],
+]
+
+
+def clear_files():
+    return "Vector Store is Not ready"
+
+
+def handle_user_message(message, history):
+    """
+    callback function for updating user messages in interface on submit button click
+
+    Params:
+      message: current message
+      history: conversation history
+    Returns:
+      None
+    """
+    # Append the user's message to the conversation history
+    return "", history + [[message, ""]]
+
+
+def make_demo(
+    example_path: str,
+    build_index: Callable,
+    search: Callable,
+    run_fn: Callable,
+    stop_fn: Callable,
+):
+
+    with gr.Blocks(
+        theme=gr.themes.Soft(),
+        css=".disclaimer {font-variant-caps: all-small-caps;}",
+    ) as demo:
+        gr.Markdown("""<h1><center>QA over Video</center></h1>""")
+        gr.Markdown(f"""<center>Powered by OpenVINO</center>""")
+        image_list = gr.State([])
+        txt_list = gr.State([])
+
+        with gr.Row():
+            with gr.Column(scale=1):
+                video_file = gr.Video(
+                    label="Step 1: Load a '.mp4' video file",
+                    value=example_path,
+                )
+                load_video = gr.Button("Step 2: Build Vector Store", variant="primary")
+                status = gr.Textbox(
+                    "Vector Store is Ready",
+                    show_label=False,
+                    max_lines=1,
+                    interactive=False,
+                )
+
+            with gr.Column(scale=3):
+                chatbot = gr.Chatbot(
+                    height=800,
+                    label="Step 3: Input Query",
+                )
+                with gr.Row():
+                    with gr.Column():
+                        with gr.Row():
+                            msg = gr.Textbox(
+                                label="QA Message Box",
+                                placeholder="Chat Message Box",
+                                show_label=False,
+                                container=False,
+                            )
+                    with gr.Column():
+                        with gr.Row():
+                            submit = gr.Button("Submit", variant="primary")
+                            stop = gr.Button("Stop")
+                            clear = gr.Button("Clear")
+                gr.Examples(
+                    examples,
+                    inputs=msg,
+                    label="Click on any example and press the 'Submit' button",
+                )
+        video_file.clear(clear_files, outputs=[status], queue=False).then(lambda: gr.Button(interactive=False), outputs=submit)
+        load_video.click(lambda: gr.Button(interactive=False), outputs=submit).then(
+            fn=build_index,
+            inputs=[video_file],
+            outputs=[status],
+            queue=True,
+        ).then(lambda: gr.Button(interactive=True), outputs=submit)
+        submit_event = (
+            msg.submit(handle_user_message, [msg, chatbot], [msg, chatbot], queue=False)
+            .then(
+                search,
+                [chatbot],
+                [image_list, txt_list],
+                queue=True,
+            )
+            .then(
+                run_fn,
+                [chatbot, image_list, txt_list],
+                chatbot,
+                queue=True,
+            )
+        )
+        submit_click_event = (
+            submit.click(handle_user_message, [msg, chatbot], [msg, chatbot], queue=False)
+            .then(
+                search,
+                [chatbot],
+                [image_list, txt_list],
+                queue=True,
+            )
+            .then(
+                run_fn,
+                [chatbot, image_list, txt_list],
+                chatbot,
+                queue=True,
+            )
+        )
+        stop.click(
+            fn=stop_fn,
+            inputs=None,
+            outputs=None,
+            cancels=[submit_event, submit_click_event],
+            queue=False,
+        )
+        clear.click(lambda: None, None, chatbot, queue=False)
+    return demo