Get the multimodal embeddings #117

hessaAlawwad · 2024-05-21T13:20:25Z

Thank you for the great model.

I wonder how can I get the multimodat embedding of different inputs like image and its caption usign Imagebind?

if I can get that then how can it be compared to CLIP?

hessaAlawwad · 2024-05-22T05:03:33Z

Please bear with me if my questions does not make sense but I am still learning.
I see after I give it an input consist of two modality ( text and image) it retrurn two different embeddings.

# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(embeddings[ModalityType.VISION])
print(embeddings[ModalityType.TEXT])

but I couldn't find something about getting one embeddings for the both.

lixinghe1999 · 2024-06-25T02:25:11Z

As far as I know, there is no multimodal embedding.
You got embedding for each modality and compare them (Text * Image) to see whether they match that the ideas of Imagebind and CLIP in short.

If you insist to get one, naive way is to add or take the average of them, however I believe it is not appealling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get the multimodal embeddings #117

Get the multimodal embeddings #117

hessaAlawwad commented May 21, 2024 •

edited

Loading

hessaAlawwad commented May 22, 2024

lixinghe1999 commented Jun 25, 2024 •

edited

Loading

Get the multimodal embeddings #117

Get the multimodal embeddings #117

Comments

hessaAlawwad commented May 21, 2024 • edited Loading

hessaAlawwad commented May 22, 2024

lixinghe1999 commented Jun 25, 2024 • edited Loading

hessaAlawwad commented May 21, 2024 •

edited

Loading

lixinghe1999 commented Jun 25, 2024 •

edited

Loading