InternVL家族：通过开源组件缩小与商业多模态模型的差距 —— GPT-4o的开源替代方案

使用文档

🌟 Get Started

Installation: 🌱 Installation Guide | 📄 requirements.txt
Chat Data Format: 📝 Meta File | ✏️ Text | 🖼️ Single-Image | 🖼️🖼️ Multi-Image | 🎥 Video
Local Chat Demo: 🤖 Streamlit Demo
InternVL-Chat API: 🌐 InternVL2-Pro
Tutorials: 🚀 Enhancing InternVL2 on COCO Caption Using LoRA Fine-Tuning

🏆 InternVL Family

InternVL 2.5: 📖 Intro | ⚡ Quick Start | ✨ Finetune | 📊 Evaluate | 📦 Deploy | 🎯 MPO
InternVL 2.0: 📖 Intro | ⚡ Quick Start | ✨ Finetune | 📊 Evaluate | 📦 Deploy | 🎯 MPO
InternVL 1.5: 📖 Intro | ⚡ Quick Start | ✨ Finetune | 📊 Evaluate | 📦 Deploy
InternVL 1.2: 📖 Intro | ⚡ Quick Start | ✨ Finetune | 📊 Evaluate
InternVL 1.1: 📖 Intro | ⚡ Quick Start | 📊 Evaluation
InternVL 1.0: 🖼️ Classification | 📊 CLIP-Benchmark | 🎨 Segmentation | 💬 Chat-LLaVA | ✨ InternVL-G

模型库

多模态大语言模型 (InternVL 2.5)

Model Name	Vision Part	Language Part	HF Link	MS Link
InternVL2_5-1B	InternViT‑300M‑448px‑V2_5	Qwen2.5‑0.5B‑Instruct	🤗 link	🤖 link
InternVL2_5-2B	InternViT-300M-448px-V2_5	internlm2_5-1_8b-chat	🤗 link	🤖 link
InternVL2_5-4B	InternViT-300M-448px-V2_5	Qwen2.5-3B-Instruct	🤗 link	🤖 link
InternVL2_5-8B	InternViT-300M-448px-V2_5	internlm2_5-7b-chat	🤗 link	🤖 link
InternVL2_5-26B	InternViT-6B-448px-V2_5	internlm2_5-20b-chat	🤗 link	🤖 link
InternVL2_5-38B	InternViT-6B-448px-V2_5	Qwen2.5-32B-Instruct	🤗 link	🤖 link
InternVL2_5-78B	InternViT-6B-448px-V2_5	Qwen2.5-72B-Instruct	🤗 link	🤖 link

Model Name	Vision Part	Language Part	HF Link	MS Link
InternVL2_5-1B-MPO	InternViT‑300M‑448px‑V2_5	Qwen2.5‑0.5B‑Instruct	🤗 link	🤖 link
InternVL2_5-2B-MPO	InternViT-300M-448px-V2_5	internlm2_5-1_8b-chat	🤗 link	🤖 link
InternVL2_5-4B-MPO	InternViT-300M-448px-V2_5	Qwen2.5-3B-Instruct	🤗 link	🤖 link
InternVL2_5-8B-MPO	InternViT-300M-448px-V2_5	internlm2_5-7b-chat	🤗 link	🤖 link
InternVL2_5-26B-MPO	InternViT-6B-448px-V2_5	internlm2_5-20b-chat	🤗 link	🤖 link
InternVL2_5-38B-MPO	InternViT-6B-448px-V2_5	Qwen2.5-32B-Instruct	🤗 link	🤖 link
InternVL2_5-78B-MPO	InternViT-6B-448px-V2_5	Qwen2.5-72B-Instruct	🤗 link	🤖 link

多模态大语言模型 (InternVL 2.0)

Model Name	Vision Part	Language Part	HF Link	MS Link
InternVL2-1B	InternViT-300M-448px	Qwen2-0.5B-Instruct	🤗 link	🤖 link
InternVL2-2B	InternViT-300M-448px	internlm2-chat-1-8b	🤗 link	🤖 link
InternVL2-4B	InternViT-300M-448px	Phi‑3‑mini‑128k‑instruct	🤗 link	🤖 link
InternVL2-8B	InternViT-300M-448px	internlm2_5-7b-chat	🤗 link	🤖 link
InternVL2-26B	InternViT-6B-448px-V1-5	internlm2-chat-20b	🤗 link	🤖 link
InternVL2-40B	InternViT‑6B‑448px‑V1‑5	Nous‑Hermes‑2‑Yi‑34B	🤗 link	🤖 link
InternVL2‑Llama3-76B	InternViT-6B-448px-V1-5	Hermes‑2‑Theta‑ Llama‑3‑70B	🤗 link	🤖 link

多模态大语言模型 (InternVL 1.0-1.5)

Model	Date	HF Link	MS Link	Note
Mini‑InternVL‑Chat‑4B‑V1‑5	2024.05.28	🤗 link	🤖 link	🚀🚀 16% 的模型大小, 90% 的性能
Mini-InternVL-Chat-2B-V1-5	2024.05.19	🤗 link	🤖 link	🚀 8% 的模型大小, 80% 的性能
InternVL-Chat-V1-5	2024.04.18	🤗 link	🤖 link	支持 4K 图像；超强的 OCR 能力；在 MMMU、DocVQA、ChartQA、MathVista 等各种基准测试中，性能接近 GPT-4V 和 Gemini Pro
InternVL-Chat-V1-2-Plus	2024.02.21	🤗 link	🤖 link	更多的 SFT 数据和更强的性能
InternVL-Chat-V1-2	2024.02.11	🤗 link	🤖 link	将 LLM 扩展到 34B
InternVL-Chat-V1-1	2024.01.24	🤗 link	🤖 link	支持中文和更强的 OCR 能力
InternVL-Chat-19B	2023.12.25	🤗 link	🤖 link	英语多模态对话
InternVL-Chat-13B	2023.12.25	🤗 link	🤖 link	英语多模态对话

类 CLIP 模型 (InternVL 1.0-2.5)

Model	Date	HF Link	MS Link	Note
InternViT-300M-448px-V2_5	2024.12.05	🤗 link	🤖 link	🚀🚀 一个更强的轻量视觉编码器 (🔥新)
InternViT-6B-448px-V2_5	2024.12.05	🤗 link	🤖 link	🚀🚀 拥有更强的视觉特征提取能力 (🔥新)
InternViT-300M-448px	2024.05.25	🤗 link	🤖 link	蒸馏的小型视觉基础模型，具有 300M 参数
InternViT‑6B‑448px‑V1‑5	2024.04.20	🤗 link	🤖 link	通过增量预训练支持动态分辨率和超强的 OCR 特征提取能力
InternViT-6B-448px-V1-2	2024.02.11	🤗 link	🤖 link	通过增量预训练支持 448 分辨率
InternViT-6B-448px-V1-0	2024.01.30	🤗 link	🤖 link	通过增量预训练支持 448 分辨率
InternViT-6B-224px	2023.12.22	🤗 link	🤖 link	InternViT-6B 的第一个版本，提取自 InternVL‑14B‑224px

视觉语言基础模型 (InternVL 1.0)

Model	Date	HF Link	MS Link	Note
InternVL‑14B‑224px	2023.12.22	🤗 link	🤖 link	视觉-语言基础模型，InternViT-6B + QLLaMA，可以用于类似 CLIP 的图文检索

TODO 列表

发布 InternVL2.5 系列的训练 / 评估代码
支持 liger kernels 以节省显存
发布 MPO 的代码、模型和数据
支持多模态 packed dataset
支持 vLLM 和 Ollama
在 Demo 中支持视频和 PDF 输入
发布集成 VisionLLMv2 的 InternVL2
使用 readthedocs 重新构建文档
支持使用 LoRA 微调不同的 LLMs
发布 InternVL2 的 requirements.txt
发布 InternVL2 系列的训练 / 评估代码
发布 InternVL1.5 和 InternVL2 的 Streamlit 网页 UI

InternVL 可以做什么?

视觉感知 (点击展开)

线性探针图像分类 [查看详情]

ViT-22B uses the private JFT-3B dataset.

method	#param	IN-1K	IN-ReaL	IN-V2	IN-A	IN-R	IN-Sketch
OpenCLIP-G	1.8B	86.2	89.4	77.2	63.8	87.8	66.4
DINOv2-g	1.1B	86.5	89.6	78.4	75.9	78.8	62.5
EVA-01-CLIP-g	1.1B	86.5	89.3	77.4	70.5	87.7	63.1
MAWS-ViT-6.5B	6.5B	87.8	-	-	-	-	-
ViT-22B*	21.7B	89.5	90.9	83.2	83.8	87.4	-
InternViT-6B (ours)	5.9B	88.2	90.4	79.9	77.5	89.8	69.1

语义分割 [查看详情]

method	decoder	#param (train/total)	crop size	mIoU
OpenCLIP-G (frozen)	Linear	0.3M / 1.8B	512	39.3
ViT-22B (frozen)	Linear	0.9M / 21.7B	504	34.6
InternViT-6B (frozen)	Linear	0.5M / 5.9B	504	47.2 (+12.6)
ViT-22B (frozen)	UperNet	0.8B / 22.5B	504	52.7
InternViT-6B (frozen)	UperNet	0.4B / 6.3B	504	54.9 (+2.2)
ViT-22B	UperNet	22.5B / 22.5B	504	55.3
InternViT-6B	UperNet	6.3B / 6.3B	504	58.9 (+3.6)

零样本图像分类 [查看详情]

method	IN-1K	IN-A	IN-R	IN-V2	IN-Sketch	ObjectNet
OpenCLIP-G	80.1	69.3	92.1	73.6	68.9	73.0
EVA-02-CLIP-E+	82.0	82.1	94.5	75.7	71.6	79.6
ViT-22B*	85.9	90.1	96.0	80.9	-	87.6
InternVL-C (ours)	83.2	83.8	95.5	77.3	73.9	80.6

多语言零样本图像分类 [查看详情]

EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

method	IN-1K (EN)	IN-1K (ZH)	IN-1K (JP)	IN-1K (AR)	IN-1K (IT)
Taiyi-CLIP-ViT-H	-	54.4	-	-	-
WuKong-ViT-L-G	-	57.5	-	-	-
CN-CLIP-ViT-H	-	59.6	-	-	-
AltCLIP-ViT-L	74.5	59.6	-	-	-
EVA-02-CLIP-E+	82.0	-	-	-	41.2
OpenCLIP-XLM-R-H	77.0	55.7	53.1	37.0	56.8
InternVL-C (ours)	83.2	64.5	61.5	44.9	65.7

零样本视频分类

method #frame K400 K600 K700

OpenCLIP-G 1 65.9 66.1 59.2

EVA-02-CLIP-E+ 1 69.8 69.3 63.4

InternVL-C (ours) 1 71.0 71.3 65.7

ViCLIP 8 75.7 73.5 66.4

InternVL-C (ours) 8 79.4 78.8 71.5

跨模态检索 (点击展开)

英语零样本图文检索 [查看详情]

model	Flickr30K						COCO						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
OpenCLIP-G	92.9	99.3	99.8	79.5	95.0	97.1	67.3	86.9	92.6	51.4	74.9	83.0	85.0
EVA-02-CLIP-E+	93.9	99.4	99.8	78.8	94.2	96.8	68.8	87.8	92.8	51.1	75.0	82.7	85.1
EVA-CLIP-8B	95.6	99.6	99.9	80.8	95.5	97.6	70.3	89.3	93.9	53.0	76.0	83.4	86.2
InternVL-C (ours)	94.7	99.6	99.9	81.7	96.0	98.2	70.6	89.0	93.5	54.1	77.3	84.6	86.6
InternVL-G (ours)	95.7	99.7	99.9	85.0	97.0	98.6	74.9	91.3	95.2	58.6	81.3	88.0	88.8

中文零样本图文检索 [查看详情]

model	Flickr30K-CN						COCO-CN						avg
	image-to-text			text-to-image			image-to-text			text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
CN-CLIP-ViT-H	81.6	97.5	98.8	71.2	91.4	95.5	63.0	86.6	92.9	69.2	89.9	96.1	86.1
OpenCLIP-XLM-R-H	86.1	97.5	99.2	71.0	90.5	94.9	70.0	91.5	97.0	66.1	90.8	96.0	87.6
InternVL-C (ours)	90.3	98.8	99.7	75.1	92.9	96.4	68.8	92.0	96.7	68.9	91.9	96.5	89.0
InternVL-G (ours)	92.9	99.4	99.8	77.7	94.8	97.3	71.4	93.9	97.7	73.8	94.4	98.1	90.9

多语言零样本图文对检索 [查看详情]

method	EN	ES	FR	ZH	IT	KO	RU	JP	average
AltCLIP	95.4	94.1	92.9	95.1	94.2	94.4	91.8	91.7	93.7
OpenCLIP-XLM-R-H	97.3	96.1	94.5	94.7	96.0	90.2	93.9	94.0	94.6
InternVL-C (ours)	97.3	95.7	95.1	95.6	96.0	92.2	93.3	95.5	95.1
InternVL-G (ours)	98.6	97.7	96.5	96.7	96.9	95.1	94.8	96.1	96.6

多模态对话

使用 HuggingFace 快速开始

使用 InternViT-6B 提取视觉特征 (点击展开)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

model = AutoModel.from_pretrained(
    'OpenGVLab/InternViT-6B-448px-V2_5',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image = Image.open('./examples/image1.jpg').convert('RGB')

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-448px-V1-5')

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

outputs = model(pixel_values)

使用 InternVL-C(ontrastive) 和 InternVL-G(enerative) 进行跨模态检索 (点击展开)

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor
from transformers import AutoTokenizer


model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')

tokenizer = AutoTokenizer.from_pretrained(
    'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True)
tokenizer.pad_token_id = 0  # set pad_token_id to 0

images = [
    Image.open('./examples/image1.jpg').convert('RGB'),
    Image.open('./examples/image2.jpg').convert('RGB'),
    Image.open('./examples/image3.jpg').convert('RGB')
]
prefix = 'summarize:'
texts = [
    prefix + 'a photo of a red panda',  # English
    prefix + '一张熊猫的照片',  # Chinese
    prefix + '二匹の猫の写真'  # Japanese
]

pixel_values = image_processor(images=images, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
input_ids = tokenizer(texts, return_tensors='pt', max_length=80,
                      truncation=True, padding='max_length').input_ids.cuda()

# InternVL-C
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-C')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],
#         [2.2949e-02, 9.7656e-01, 5.9903e-06],
#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# InternVL-G
logits_per_image, logits_per_text = model(
    image=pixel_values, text=input_ids, mode='InternVL-G')
probs = logits_per_image.softmax(dim=-1)
# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],
#         [8.6060e-03, 9.9219e-01, 2.8759e-06],
#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>)

# please set add_eos_token to False for generation
tokenizer.add_eos_token = False
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

tokenized = tokenizer("English caption:", return_tensors='pt')
pred = model.generate(
    pixel_values=pixel_values,
    input_ids=tokenized.input_ids.cuda(),
    attention_mask=tokenized.attention_mask.cuda(),
    num_beams=5,
    min_new_tokens=8,
)
caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()
# English caption: a red panda sitting on top of a wooden platform

使用 InternVL 2.5 进行多模态对话 (点击展开)

这里我们以较小的 OpenGVLab/InternVL2_5-8B 为例：

import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
# Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = 'OpenGVLab/InternVL2_5-8B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=False)

# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame-{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

许可证

本项目以 MIT 许可证发布。项目中的部分代码和模型来自其它来源，受其原始许可证的约束。

引用

如果您在研究中发现本项目有用，请考虑引用：

@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
}
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
}
@article{gao2024mini,
  title={Mini-InternVL: a flexible-transfer pocket multi-modal model with 5\% parameters and 90\% performance},
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
  journal={Visual Intelligence},
  volume={2},
  number={1},
  pages={1--17},
  year={2024},
  publisher={Springer}
}
@article{chen2024far,
  title={How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={Science China Information Sciences},
  volume={67},
  number={12},
  pages={220101},
  year={2024},
  publisher={Springer}
}
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
}

致谢

InternVL 的代码构建参考了以下的项目: OpenAI CLIP、Open CLIP、CLIP Benchmark、EVA、InternImage、ViT-Adapter、MMSegmentation、Transformers、DINOv2、BLIP-2、Qwen-VL和 LLaVA-1.5，感谢这些杰出的工作。

扫描下方二维码，加入我们的项目微信群。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_zh.md

README_zh.md

InternVL家族：通过开源组件缩小与商业多模态模型的差距 —— GPT-4o的开源替代方案

最新消息 🚀🚀🚀

使用文档

🌟 Get Started

🏆 InternVL Family

模型库

多模态大语言模型 (InternVL 2.5)

多模态大语言模型 (InternVL 2.0)

多模态大语言模型 (InternVL 1.0-1.5)

类 CLIP 模型 (InternVL 1.0-2.5)

视觉语言基础模型 (InternVL 1.0)

TODO 列表

InternVL 可以做什么?

使用 HuggingFace 快速开始

许可证

引用

致谢

method	#frame	K400	K600	K700
OpenCLIP-G	1	65.9	66.1	59.2
EVA-02-CLIP-E+	1	69.8	69.3	63.4
InternVL-C (ours)	1	71.0	71.3	65.7
ViCLIP	8	75.7	73.5	66.4
InternVL-C (ours)	8	79.4	78.8	71.5

Files

README_zh.md

Latest commit

History

README_zh.md

File metadata and controls

InternVL家族：通过开源组件缩小与商业多模态模型的差距 —— GPT-4o的开源替代方案

最新消息 🚀🚀🚀

使用文档

🌟 Get Started

🏆 InternVL Family

模型库

多模态大语言模型 (InternVL 2.5)

多模态大语言模型 (InternVL 2.0)

多模态大语言模型 (InternVL 1.0-1.5)

类 CLIP 模型 (InternVL 1.0-2.5)

视觉语言基础模型 (InternVL 1.0)

TODO 列表

InternVL 可以做什么?

使用 HuggingFace 快速开始

许可证

引用

致谢