chore: analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) #365

korablique · 2024-11-24T19:31:52Z

This is the research conducted in the next issue: #349

01_extract_data.py: extracts all texts with their languages from huggingface dataset.

02_select_short_texts_with_known_ingredients.py: filters texts with length up to 10 words, performs ingredient analysis by OFF API, selects ingredient texts with at least 80% of known ingredients, adds short texts from manually checked data.

What is manually checked data:
I created a validation dataset from texts from OFF (42 languages, 15-30 texts per language).
I took 30 random texts in each language, obtained language predictions using the Deepl API and two other models (language-detection-fine-tuned-on-xlm-roberta-base and multilingual-e5-language-detection). For languages they don’t support, I used Google Translate and ChatGPT for verification. (As a result, after correcting the labels, some languages have fewer than 30 texts).

03_calculate_metrics.py: obtains predictions by FastText and lingua language detector models for texts up to 10 words long, and calculates precision, recall and f1-score.

Results are in files: 10_words_metrics.csv, fasttext_confusion_matrix.csv, lingua_confusion_matrix.csv.

It turned out that both models demonstrate low precision and high recall for some languages (indicating that the threshold might be too high and should be adjusted).

short ingredient texts with precision-recall evaluation (openfoodfacts#349)

baslia

I think that would be great to have a separate file (like 04_inference.py) dedicated for inference, then it would be easy to warp the code and deploy it in the future

korablique · 2024-12-01T14:33:55Z

@baslia Could you please help me with this error? https://github.com/openfoodfacts/openfoodfacts-ai/actions/runs/12106320879/job/33751933052?pr=365
I haven't worked with labeler before, don't understand how to fix this issue

baslia · 2024-12-03T20:22:04Z

@baslia Could you please help me with this error? https://github.com/openfoodfacts/openfoodfacts-ai/actions/runs/12106320879/job/33751933052?pr=365 I haven't worked with labeler before, don't understand how to fix this issue

Hey, I thought this was just a check about the PR label attribute, I just attached the label "enhancement".
But this isn't working, but I don't think this test is a big deal

raphael0202 · 2024-12-04T14:17:36Z

Yes indeed it's a configuration issue from the repo, it's safe to ignore!
I haven't have time to review your work yet, I will find some time to do it next week!
Thank you for pushing the code.

language_identification/scripts/01_extract_data.py

raphael0202 · 2024-12-12T08:52:28Z

language_identification/scripts/01_extract_data.py

+
+    df = pd.DataFrame(data, columns=["ingredients_text", "lang"])
+    df.dropna(inplace=True)
+    df.to_csv(dataset_file, index=False)


I had an issue when generating the CSV:
_csv.Error: need to escape, but no escapechar set
By providing an escape char, it works:
df.to_csv(dataset_file, index=False, escapechar="\\")

raphael0202 · 2024-12-12T09:06:39Z

Thank you a lot for this PR!
Lingua results look much better than fasttext, in almost all languages :)

I will add a few suggestions for the next steps in the main issue #349

analyze language identification models performance on

09eed76

short ingredient texts with precision-recall evaluation (openfoodfacts#349)

korablique requested a review from a team as a code owner November 24, 2024 19:31

github-actions bot assigned korablique Nov 24, 2024

baslia requested review from baslia and raphael0202 November 25, 2024 05:52

baslia reviewed Nov 25, 2024

View reviewed changes

split script 03_calculate_metrics into inference and metrics

c90f24d

korablique changed the title ~~analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349)~~ chore: analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) Dec 1, 2024

chore: add new label for language identification in labeler.yml

e0a2234

baslia added the ✨ enhancement New feature or request label Dec 3, 2024

raphael0202 reviewed Dec 12, 2024

View reviewed changes

language_identification/scripts/01_extract_data.py Outdated Show resolved Hide resolved

raphael0202 reviewed Dec 12, 2024

View reviewed changes

Update language_identification/scripts/01_extract_data.py

1005cfe

raphael0202 self-requested a review December 12, 2024 09:07

raphael0202 approved these changes Dec 12, 2024

View reviewed changes

raphael0202 merged commit 405eb1d into openfoodfacts:develop Dec 12, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) #365

chore: analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) #365

korablique commented Nov 24, 2024

baslia left a comment

korablique commented Dec 1, 2024

baslia commented Dec 3, 2024 •

edited

Loading

raphael0202 commented Dec 4, 2024

raphael0202 Dec 12, 2024 •

edited

Loading

raphael0202 commented Dec 12, 2024

chore: analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) #365

chore: analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) #365

Conversation

korablique commented Nov 24, 2024

baslia left a comment

Choose a reason for hiding this comment

korablique commented Dec 1, 2024

baslia commented Dec 3, 2024 • edited Loading

raphael0202 commented Dec 4, 2024

raphael0202 Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

raphael0202 commented Dec 12, 2024

baslia commented Dec 3, 2024 •

edited

Loading

raphael0202 Dec 12, 2024 •

edited

Loading