Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change message on missing language data for video #227

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

elfkuzco
Copy link
Contributor

@elfkuzco elfkuzco commented Sep 23, 2024

Rationale

Sometimes, even though a video exists for a particular native language, there is no available subtitle/language data for the video in that language. This PR re-words the warning message to a more descriptive message. This resolves #216.

Changes

  • Fetch video url using it's canonical url and build up warning message. The get attribute is used as we are in an exception handler and using the [] to access the canonical url might throw an error given this data is from an external API and we don't want to throw an exception from here.

Copy link

codecov bot commented Sep 23, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 5.00%. Comparing base (f976226) to head (b802aaf).

Files with missing lines Patch % Lines
src/ted2zim/scraper.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main    #227   +/-   ##
=====================================
  Coverage   5.00%   5.00%           
=====================================
  Files          8       8           
  Lines       1098    1098           
  Branches     238     238           
=====================================
  Hits          55      55           
  Misses      1042    1042           
  Partials       1       1           
Flag Coverage Δ
5.00% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 self-requested a review September 24, 2024 08:33
Copy link
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking a bit more in-depth to this code, I believe that we were wrong on the issue analysis. Sorry for that.

I now don't get why the scraper keeps trying to find the language name for every video, while in fact these names should be fixed for the whole scrape.

I.e. once we know that en is English, we need to keep this information somewhere.

Also why are we searching for lang name for subtitles if they do not exists? I realize that I don't get your reasoning in fact.

@elfkuzco
Copy link
Contributor Author

Okay, um, I tried to drill down the code to understand the error and here's what I found (probably not 100% spot-on).

When the scraper is called without the language or subtitle setting, each video in a particular topic has its own languages and thus can't be set for the whole scrape. So, when it fetches the first video in a topic, it uses its data to

  • obtain all the available languages for that video
  • tries to see if the video has been translated to that language or there is a matching subtitle because the default setting for subtitles is matching.

The definition of a video being "translated" means metadata for the video, like title, etc has been written in that language and not necessarily that the audio is in that language. The logic for update_videos_list handles this correctly by simply adding the metadata for the video. Visiting a video like https://ted.com/talks/matt_mills_image_recognition_that_triggers_augmented_reality?language=fr-ca also shows that it is accurate as the video exists in the language. You might want to add &subtitle=fr to the query as their website keeps auto selecting subtitle.

Then, there are also edge cases where a video exists in a language like English but it doesn't have any languages data which means that there aren't even any subtitles for it. For example: https://www.ted.com/talks/dan_walker_can_we_print_smart_objects. I think it's cases like this that the error throws up.

@elfkuzco elfkuzco requested a review from benoit74 September 25, 2024 09:20
@benoit74
Copy link
Collaborator

I''m sorry, I won't have time to delve into it seriously for the moment. Thank you for this analysis, and I will have to delve to be sure to implement the right thing. Keeping this PR open until then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lots of warnings WARNING:player data has no entry for en: list index out of range
2 participants