-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local paths in common voice #3736
Conversation
"""Yields examples.""" | ||
if archive_iterator is not None: | ||
yield from self._generate_examples_streaming(archive_iterator, filepath, path_to_clips) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit - I'd even pass the streaming
flag here to make it super clear that they are two different modes and maybe have both a _generate_examples_streaming(...)
and a _generate_examples_non_streaming(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great to me! Think it's quite easy to understand that there are two parallel ways now on how to download and prepare audio datasets
- pass streaming to _generate_examples - separate in two methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the importance of this PR is fixing Common Voice, once we realize that the approach of non-extracting the archive content in non-streaming mode is not optimal/suitable for users who want to have direct access to audio file paths.
So OK for approving it and merge/fix Common Voice the faster the better.
But on the other hand, IMHO, I think this specific solution adds complexity to handling streaming/non-streaming, and moves this complexity to the loading script and thus to the contributors/users who want to create the loading script for their canonical/community datasets (instead of keeping it hidden form the end users).
Maybe we could have a discussion in the near future to see if it is possible to find other solutions (or not), while keeping the requirement of having access to the audio file paths in non-streaming. It is just a suggestion! :)
src/datasets/builder.py
Outdated
split_generators_kwargs = {} | ||
split_generators_arg_names = inspect.signature(self._split_generators).parameters.keys() | ||
if "streaming" in split_generators_arg_names: | ||
streaming = isinstance(prepare_split_kwargs.get("dl_manager"), StreamingDownloadManager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you need the DownloadManager
instance to find out whether we are in streaming mode or not... Because the builder itself knows nothing about streaming or not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed having this logic inside the builder
can be a bit unexpected. An alternative would be to replace the streaming
parameter by
streaming = dl_manager.is_streaming
inside the dataset script
I just changed to I think it's better this way, but let me know if you preferred the previous way and I can revert
I'm down to discuss this more in the future ! |
@lhoestq good idea: much cleaner this way! That way each class has its own responsibilities without mixing around... |
Continuation of #3664:
streaming
parameter to _split_generatorcommon_voice
download_and_extract
in non-streaming anditer_archive
in streamingNow the
common_voice
dataset has a local path back inds["path"]
, and this field isNone
in streaming mode.cc @patrickvonplaten @anton-l @albertvillanova
Fix #3663.