Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't duplicate data when encoding audio or image #4187

Merged
merged 3 commits into from
Apr 21, 2022

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Apr 20, 2022

Right now if you pass both the bytes and a local path for audio or image data, then the bytes are unnecessarily written in the Arrow file, while we could just keep the local path.

This PR discards the bytes when the audio or image file exists locally.

In particular it's common for audio datasets builders to provide both the bytes and the local path in order to work for both streaming (using the bytes) and non-streaming mode (using a local file - which is often required for audio).

cc @patrickvonplaten

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 20, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM!

@albertz
Copy link

albertz commented Apr 20, 2022

I'm not familiar with the concept of streaming vs non-streaming in HF datasets. I just wonder that you have the distinction here. Why doesn't it work to always make use of bytes? "using a local file - which is often required for audio" - why would that be?

The path would always point to some location in the cache_dir? I think this can be problematic. I would have expected that after I did dataset.save_to_disk(...) that I can remove the cache dir. But maybe just because I'm not familiar with HF. Or maybe the docs can be improved to clarify this.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Apr 20, 2022

We could always load every data file into bytes and save it this way the audio as bytes in arrow format, but the problem then would be that it makes the file column useless, i.e. people cannot inspect the audio file locally anymore or else they would need to first save bytes as a file which is not evident. This either breaks backwards compatibility or forces the user to stored 2x the required size locally. There was a longer discussion here: #3663

It's a good argument though that dataset.save_to_disk(...) should save everything that is needed to the disk and should be independent of other folders, but I do think the arguments of #3663 to not break backwards compatibility and to allow people to inspect the downloaded audio files locally are a bit more important here.

But maybe, we could add a flag, save_files_as_bytes or make_independent, make_self_contained or a better name to save_to_disk(...) and push_to_hub(...) that would allow to make the resulting folder completely independent.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Apr 20, 2022

What do you think @mariosasko @lhoestq @polinaeterna @anton-l ?

@lhoestq
Copy link
Member Author

lhoestq commented Apr 21, 2022

For context: you can either store the path to local images or audio files, or the bytes of those files.

If your images and audio files are local files, then the arrow file from save_to_disk will store paths to these files.
If you want to include the bytes or your images or audio files instead, you must read() those files first.
This can be done by storing the "bytes" instead of the "path" of the images or audio files.

On the other hand, the resulting Parquet files from push_to_hub are self-contained, so that anyone can reload the dataset from the Hub. If your dataset contains image or audio data, the Parquet files will store the bytes of your images or audio files.

For now I just updated the documentation: #4193. Maybe we can also embed the image and audio bytes in save_to_disk when we implement sharding, so that is can be done as efficiently as push_to_hub.

Anyway, merging this one :)

@lhoestq lhoestq merged commit b564af7 into master Apr 21, 2022
@lhoestq lhoestq deleted the dont-duplicate-data-when-encoding-audio-or-image branch April 21, 2022 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants