Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get the multimodal embeddings #117

Open
hessaAlawwad opened this issue May 21, 2024 · 2 comments
Open

Get the multimodal embeddings #117

hessaAlawwad opened this issue May 21, 2024 · 2 comments

Comments

@hessaAlawwad
Copy link

hessaAlawwad commented May 21, 2024

Thank you for the great model.

I wonder how can I get the multimodat embedding of different inputs like image and its caption usign Imagebind?

if I can get that then how can it be compared to CLIP?

@hessaAlawwad
Copy link
Author

Please bear with me if my questions does not make sense but I am still learning.
I see after I give it an input consist of two modality ( text and image) it retrurn two different embeddings.

# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(embeddings[ModalityType.VISION])
print(embeddings[ModalityType.TEXT])

but I couldn't find something about getting one embeddings for the both.

@lixinghe1999
Copy link

lixinghe1999 commented Jun 25, 2024

As far as I know, there is no multimodal embedding.
You got embedding for each modality and compare them (Text * Image) to see whether they match that the ideas of Imagebind and CLIP in short.

If you insist to get one, naive way is to add or take the average of them, however I believe it is not appealling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants