-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't duplicate data when encoding audio or image #4187
Don't duplicate data when encoding audio or image #4187
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM!
I'm not familiar with the concept of streaming vs non-streaming in HF datasets. I just wonder that you have the distinction here. Why doesn't it work to always make use of The |
We could always load every data file into It's a good argument though that But maybe, we could add a flag, |
What do you think @mariosasko @lhoestq @polinaeterna @anton-l ? |
For context: you can either store the path to local images or audio files, or the bytes of those files. If your images and audio files are local files, then the arrow file from On the other hand, the resulting Parquet files from For now I just updated the documentation: #4193. Maybe we can also embed the image and audio bytes in Anyway, merging this one :) |
Right now if you pass both the
bytes
and a localpath
for audio or image data, then thebytes
are unnecessarily written in the Arrow file, while we could just keep the localpath
.This PR discards the
bytes
when the audio or image file exists locally.In particular it's common for audio datasets builders to provide both the bytes and the local path in order to work for both streaming (using the bytes) and non-streaming mode (using a local file - which is often required for audio).
cc @patrickvonplaten