Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to calculate the image_text_similarity scores for both Chinese and English? #473

Open
weiaicunzai opened this issue Nov 5, 2024 · 3 comments
Assignees
Labels
dj:multimodal issues/PRs about multimodal data processing dj:op issues/PRs about some specific OPs question Further information is requested

Comments

@weiaicunzai
Copy link

weiaicunzai commented Nov 5, 2024

Thank you for your excellent work.

Regarding my dataset, which includes both English and Chinese samples, I am wondering how I can simultaneously calculate the similarity scores between image and text pairs for both languages.

@weiaicunzai weiaicunzai added the question Further information is requested label Nov 5, 2024
@HYLcool
Copy link
Collaborator

HYLcool commented Nov 14, 2024

Hi @weiaicunzai , thanks for your attention on Data-Juicer~

We use CLIP as the default model to calculate the embeddings of image-text pairs, which works fine on English corpus but not on Chinese texts (ref openai/CLIP#7). For Chinese texts, models like Chinese-CLIP might perform better.

So there is a possible way to do so is to split the datasets into two subsets in English and Chinese with our dedicated dataset_split_by_language tool, and then deploy different models for the image_text_similarity_filter OP to handle them respectively.

@HYLcool HYLcool added dj:multimodal issues/PRs about multimodal data processing dj:op issues/PRs about some specific OPs labels Nov 14, 2024
@weiaicunzai
Copy link
Author

Hi @weiaicunzai , thanks for your attention on Data-Juicer~

We use CLIP as the default model to calculate the embeddings of image-text pairs, which works fine on English corpus but not on Chinese texts (ref openai/CLIP#7). For Chinese texts, models like Chinese-CLIP might perform better.

So there is a possible way to do so is to split the datasets into two subsets in English and Chinese with our dedicated dataset_split_by_language tool, and then deploy different models for the image_text_similarity_filter OP to handle them respectively.

Thanks, is there any Chinese Blip model that can be used in image_text_matching_filter op? Similar to Salesforce/blip-itm-base-coco .

@HYLcool
Copy link
Collaborator

HYLcool commented Dec 30, 2024

Thanks, is there any Chinese Blip model that can be used in image_text_matching_filter op? Similar to Salesforce/blip-itm-base-coco .

Maybe you can look for it on HuggingFace Hub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:multimodal issues/PRs about multimodal data processing dj:op issues/PRs about some specific OPs question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants