Don't duplicate data when encoding audio or image #4187

lhoestq · 2022-04-20T13:50:37Z

Right now if you pass both the bytes and a local path for audio or image data, then the bytes are unnecessarily written in the Arrow file, while we could just keep the local path.

This PR discards the bytes when the audio or image file exists locally.

In particular it's common for audio datasets builders to provide both the bytes and the local path in order to work for both streaming (using the bytes) and non-streaming mode (using a local file - which is often required for audio).

cc @patrickvonplaten

HuggingFaceDocBuilderDev · 2022-04-20T14:00:39Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko

Thanks! LGTM!

albertz · 2022-04-20T15:13:01Z

I'm not familiar with the concept of streaming vs non-streaming in HF datasets. I just wonder that you have the distinction here. Why doesn't it work to always make use of bytes? "using a local file - which is often required for audio" - why would that be?

The path would always point to some location in the cache_dir? I think this can be problematic. I would have expected that after I did dataset.save_to_disk(...) that I can remove the cache dir. But maybe just because I'm not familiar with HF. Or maybe the docs can be improved to clarify this.

patrickvonplaten · 2022-04-20T17:55:19Z

We could always load every data file into bytes and save it this way the audio as bytes in arrow format, but the problem then would be that it makes the file column useless, i.e. people cannot inspect the audio file locally anymore or else they would need to first save bytes as a file which is not evident. This either breaks backwards compatibility or forces the user to stored 2x the required size locally. There was a longer discussion here: #3663

It's a good argument though that dataset.save_to_disk(...) should save everything that is needed to the disk and should be independent of other folders, but I do think the arguments of #3663 to not break backwards compatibility and to allow people to inspect the downloaded audio files locally are a bit more important here.

But maybe, we could add a flag, save_files_as_bytes or make_independent, make_self_contained or a better name to save_to_disk(...) and push_to_hub(...) that would allow to make the resulting folder completely independent.

patrickvonplaten · 2022-04-20T17:55:29Z

What do you think @mariosasko @lhoestq @polinaeterna @anton-l ?

lhoestq · 2022-04-21T09:10:30Z

For context: you can either store the path to local images or audio files, or the bytes of those files.

If your images and audio files are local files, then the arrow file from save_to_disk will store paths to these files.
If you want to include the bytes or your images or audio files instead, you must read() those files first.
This can be done by storing the "bytes" instead of the "path" of the images or audio files.

On the other hand, the resulting Parquet files from push_to_hub are self-contained, so that anyone can reload the dataset from the Hub. If your dataset contains image or audio data, the Parquet files will store the bytes of your images or audio files.

For now I just updated the documentation: #4193. Maybe we can also embed the image and audio bytes in save_to_disk when we implement sharding, so that is can be done as efficiently as push_to_hub.

Anyway, merging this one :)

lhoestq added 3 commits April 20, 2022 15:46

don't duplicate data in audio

abad09a

don't duplicate data in image

f8a8553

one more comment

7b73fee

lhoestq requested a review from mariosasko April 20, 2022 13:50

lhoestq mentioned this pull request Apr 20, 2022

[Librispeech] Add 'all' config #4184

Merged

mariosasko approved these changes Apr 20, 2022

View reviewed changes

lhoestq mentioned this pull request Apr 21, 2022

Document save_to_disk and push_to_hub on images and audio files #4193

Merged

lhoestq merged commit b564af7 into master Apr 21, 2022

lhoestq deleted the dont-duplicate-data-when-encoding-audio-or-image branch April 21, 2022 09:10

lhoestq mentioned this pull request Apr 26, 2022

[Common Voice] Make sure bytes are correctly deleted if path exists #4212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't duplicate data when encoding audio or image #4187

Don't duplicate data when encoding audio or image #4187

lhoestq commented Apr 20, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 20, 2022 •

edited

Loading

mariosasko left a comment

albertz commented Apr 20, 2022 •

edited

Loading

patrickvonplaten commented Apr 20, 2022 •

edited

Loading

patrickvonplaten commented Apr 20, 2022 •

edited

Loading

lhoestq commented Apr 21, 2022 •

edited

Loading

Don't duplicate data when encoding audio or image #4187

Don't duplicate data when encoding audio or image #4187

Conversation

lhoestq commented Apr 20, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Apr 20, 2022 • edited Loading

mariosasko left a comment

Choose a reason for hiding this comment

albertz commented Apr 20, 2022 • edited Loading

patrickvonplaten commented Apr 20, 2022 • edited Loading

patrickvonplaten commented Apr 20, 2022 • edited Loading

lhoestq commented Apr 21, 2022 • edited Loading

lhoestq commented Apr 20, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 20, 2022 •

edited

Loading

albertz commented Apr 20, 2022 •

edited

Loading

patrickvonplaten commented Apr 20, 2022 •

edited

Loading

patrickvonplaten commented Apr 20, 2022 •

edited

Loading

lhoestq commented Apr 21, 2022 •

edited

Loading