Local paths in common voice #3736

lhoestq · 2022-02-16T15:01:29Z

Continuation of #3664:

pass the streaming parameter to _split_generator
update @anton-l's code to use this parameter for common_voice
add a comment to explain why we use download_and_extract in non-streaming and iter_archive in streaming

Now the common_voice dataset has a local path back in ds["path"], and this field is None in streaming mode.

cc @patrickvonplaten @anton-l @albertvillanova

patrickvonplaten · 2022-02-16T16:40:27Z

datasets/common_voice/common_voice.py

        """Yields examples."""
+        if archive_iterator is not None:
+            yield from self._generate_examples_streaming(archive_iterator, filepath, path_to_clips)


small nit - I'd even pass the streaming flag here to make it super clear that they are two different modes and maybe have both a _generate_examples_streaming(...) and a _generate_examples_non_streaming(...)

patrickvonplaten

That looks great to me! Think it's quite easy to understand that there are two parallel ways now on how to download and prepare audio datasets

- pass streaming to _generate_examples - separate in two methods

albertvillanova

I think the importance of this PR is fixing Common Voice, once we realize that the approach of non-extracting the archive content in non-streaming mode is not optimal/suitable for users who want to have direct access to audio file paths.

So OK for approving it and merge/fix Common Voice the faster the better.

But on the other hand, IMHO, I think this specific solution adds complexity to handling streaming/non-streaming, and moves this complexity to the loading script and thus to the contributors/users who want to create the loading script for their canonical/community datasets (instead of keeping it hidden form the end users).

Maybe we could have a discussion in the near future to see if it is possible to find other solutions (or not), while keeping the requirement of having access to the audio file paths in non-streaming. It is just a suggestion! :)

albertvillanova · 2022-02-21T15:18:16Z

src/datasets/builder.py

+        split_generators_kwargs = {}
+        split_generators_arg_names = inspect.signature(self._split_generators).parameters.keys()
+        if "streaming" in split_generators_arg_names:
+            streaming = isinstance(prepare_split_kwargs.get("dl_manager"), StreamingDownloadManager)


I guess you need the DownloadManager instance to find out whether we are in streaming mode or not... Because the builder itself knows nothing about streaming or not...

Indeed having this logic inside the builder can be a bit unexpected. An alternative would be to replace the streaming parameter by

streaming = dl_manager.is_streaming

inside the dataset script

lhoestq · 2022-02-21T16:25:11Z

I just changed to dl_manager.is_streaming rather than an additional parameter streaming that has to be handled by the DatasetBuilder class - this way the streaming logic doesn't interfere with the base builder's code.

I think it's better this way, but let me know if you preferred the previous way and I can revert

But on the other hand, IMHO, I think this specific solution adds complexity to handling streaming/non-streaming, and moves this complexity to the loading script and thus to the contributors/users who want to create the loading script for their canonical/community datasets (instead of keeping it hidden form the end users).

I'm down to discuss this more in the future !

albertvillanova · 2022-02-22T07:13:06Z

@lhoestq good idea: much cleaner this way! That way each class has its own responsibilities without mixing around...

anton-l and others added 3 commits February 16, 2022 14:28

Merge generators for local files and streaming

5cabd27

add the streaming parameter to _split_generators

193130e

update common_voice

bb8c730

patrickvonplaten reviewed Feb 16, 2022

View reviewed changes

patrick's comment:

e3a59c3

- pass streaming to _generate_examples - separate in two methods

lhoestq requested a review from albertvillanova February 16, 2022 18:27

albertvillanova approved these changes Feb 21, 2022

View reviewed changes

lhoestq added 3 commits February 21, 2022 17:20

add is_streaming attribute to the dl managers

fa58a9c

revert the streaming parameter being passed to _split_generators

c325024

Merge branch 'master' into local-paths-in-common_voice

5288e6f

lhoestq merged commit e3c8e25 into master Feb 22, 2022

lhoestq deleted the local-paths-in-common_voice branch February 22, 2022 09:13

This was referenced Feb 22, 2022

[WIP] Return local paths to Common Voice #3664

Closed

[Audio] Path of Common Voice cannot be used for audio loading anymore #3663

Closed

lhoestq mentioned this pull request Mar 3, 2022

Simplify Common Voice code #3817

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local paths in common voice #3736

Local paths in common voice #3736

lhoestq commented Feb 16, 2022 •

edited by albertvillanova

Loading

patrickvonplaten Feb 16, 2022

patrickvonplaten left a comment

albertvillanova left a comment

albertvillanova Feb 21, 2022

lhoestq Feb 21, 2022 •

edited

Loading

lhoestq commented Feb 21, 2022 •

edited

Loading

albertvillanova commented Feb 22, 2022

Local paths in common voice #3736

Local paths in common voice #3736

Conversation

lhoestq commented Feb 16, 2022 • edited by albertvillanova Loading

patrickvonplaten Feb 16, 2022

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova Feb 21, 2022

Choose a reason for hiding this comment

lhoestq Feb 21, 2022 • edited Loading

Choose a reason for hiding this comment

lhoestq commented Feb 21, 2022 • edited Loading

albertvillanova commented Feb 22, 2022

lhoestq commented Feb 16, 2022 •

edited by albertvillanova

Loading

lhoestq Feb 21, 2022 •

edited

Loading

lhoestq commented Feb 21, 2022 •

edited

Loading