-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement to_cudf method for reading directly into GPU memory #19
base: master
Are you sure you want to change the base?
Conversation
Is it not the case that the Furthermore, I think that as coded, you would get different behaviour for read/to_dask depending on whether you had called to_cudf first or not. |
Good point. Perhaps the engine argument should be passed in directly when reading, i.e. as
Will need to test this out, do you mean the reuse of the |
It feels like you can either have the
Exactly - which is why I have a slight preference to setting the engine in init |
Yes the former implementation of using only
Maybe we should deprecate the 'engine' kwarg, and pass it |
I would have engine=
and then any of these can work with read() and to_dask(); the engine= parameter already has this meaning for pandas and Dask. |
I'm ok with this. However, the implementation will be harder, and the codebase will need to change significantly to handle this. Currently, the parquet dataset is loaded lazily via dask first: intake-parquet/intake_parquet/source.py Lines 59 to 60 in e3eab2a
and intake-parquet/intake_parquet/source.py Lines 73 to 78 in e3eab2a
In the So yes, we could go with |
Please remind me next week, and I can try: I'm sure we don't need to introduce cudf as a dependency in order to support the engine= keyword. |
Sure, let's check next week. Oh, and we definitely don't need to introduce |
I see. I don't know how dask/cudf are implemented internally, it should be possible to get the base information without dask and then pass to dask-cudf when calling to_dask. This is also a shortcoming of the non-cudf branch, dask is assumed in many places for various drivers. |
Completely forgot about this from so long ago, sorry. #28 does some similar work to make dask optional, so passing the engine through will work now. If there is still interest, that is. |
Adds a
to_cudf
method to allow reading parquet files intocudf.DataFrame
objects. Fixes #17.Let me know if I should add more docstrings or tests, this PR is fairly minimal at this point.