Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local paths in common voice #3736

Merged
merged 7 commits into from
Feb 22, 2022
Merged

Local paths in common voice #3736

merged 7 commits into from
Feb 22, 2022

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Feb 16, 2022

Continuation of #3664:

  • pass the streaming parameter to _split_generator
  • update @anton-l's code to use this parameter for common_voice
  • add a comment to explain why we use download_and_extract in non-streaming and iter_archive in streaming

Now the common_voice dataset has a local path back in ds["path"], and this field is None in streaming mode.

cc @patrickvonplaten @anton-l @albertvillanova

Fix #3663.

"""Yields examples."""
if archive_iterator is not None:
yield from self._generate_examples_streaming(archive_iterator, filepath, path_to_clips)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit - I'd even pass the streaming flag here to make it super clear that they are two different modes and maybe have both a _generate_examples_streaming(...) and a _generate_examples_non_streaming(...)

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great to me! Think it's quite easy to understand that there are two parallel ways now on how to download and prepare audio datasets

- pass streaming to _generate_examples
- separate in two methods
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the importance of this PR is fixing Common Voice, once we realize that the approach of non-extracting the archive content in non-streaming mode is not optimal/suitable for users who want to have direct access to audio file paths.

So OK for approving it and merge/fix Common Voice the faster the better.

But on the other hand, IMHO, I think this specific solution adds complexity to handling streaming/non-streaming, and moves this complexity to the loading script and thus to the contributors/users who want to create the loading script for their canonical/community datasets (instead of keeping it hidden form the end users).

Maybe we could have a discussion in the near future to see if it is possible to find other solutions (or not), while keeping the requirement of having access to the audio file paths in non-streaming. It is just a suggestion! :)

split_generators_kwargs = {}
split_generators_arg_names = inspect.signature(self._split_generators).parameters.keys()
if "streaming" in split_generators_arg_names:
streaming = isinstance(prepare_split_kwargs.get("dl_manager"), StreamingDownloadManager)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you need the DownloadManager instance to find out whether we are in streaming mode or not... Because the builder itself knows nothing about streaming or not...

Copy link
Member Author

@lhoestq lhoestq Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed having this logic inside the builder can be a bit unexpected. An alternative would be to replace the streaming parameter by

streaming = dl_manager.is_streaming

inside the dataset script

@lhoestq
Copy link
Member Author

lhoestq commented Feb 21, 2022

I just changed to dl_manager.is_streaming rather than an additional parameter streaming that has to be handled by the DatasetBuilder class - this way the streaming logic doesn't interfere with the base builder's code.

I think it's better this way, but let me know if you preferred the previous way and I can revert

But on the other hand, IMHO, I think this specific solution adds complexity to handling streaming/non-streaming, and moves this complexity to the loading script and thus to the contributors/users who want to create the loading script for their canonical/community datasets (instead of keeping it hidden form the end users).

I'm down to discuss this more in the future !

@albertvillanova
Copy link
Member

@lhoestq good idea: much cleaner this way! That way each class has its own responsibilities without mixing around...

@lhoestq lhoestq merged commit e3c8e25 into master Feb 22, 2022
@lhoestq lhoestq deleted the local-paths-in-common_voice branch February 22, 2022 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants