-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically split input dataset in ray mode #415
base: main
Are you sure you want to change the base?
Conversation
This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day. |
Close this stale PR. |
Cc: @pan-x-c, @chenyushuo When available, please add the new rule that considers the Ray's auto-split feature in this PR and resolve conflicts for CR. Additionally, we need to incorporate the streaming_load_json patch into the main branch to align with our 2.0 paper. |
Description
Split the dataset files into small pieces and process them in different batches to avoid exceeding the memory limit of Ray.