Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a to_cudf method for reading directly into GPU memory #17

Open
weiji14 opened this issue Aug 21, 2020 · 2 comments · May be fixed by #19
Open

Add a to_cudf method for reading directly into GPU memory #17

weiji14 opened this issue Aug 21, 2020 · 2 comments · May be fixed by #19

Comments

@weiji14
Copy link

weiji14 commented Aug 21, 2020

Hi there,

Just wondering if there's scope for a to_cudf type functionality so that users can read Parquet files directly into GPU memory (bypassing the CPU). This would be using the cudf.read_parquet function.

Happy to submit a Pull Request for this, but would like to have a discussion around the implementation, whether it should be handled as a to_cudf method, or via something like engine="cudf" (though cudf also has a "pyarrow" engine like pandas).

One issue though is that cudf cannot read multi-file Parquet folders yet (see rapidsai/cudf#1688), only single binary parquet files. This might get implemented in the future (v0.16?) cudf release though.

@martindurant
Copy link
Member

I could see it either way, as an argument to to_pandas (and/or to_dask), or as its own method. How many of the sources do you think it would apply to? I know cuDF have performant parquet and CSV readers.

@weiji14
Copy link
Author

weiji14 commented Aug 22, 2020

I could see it either way, as an argument to to_pandas (and/or to_dask), or as its own method.

True, since it's possible to have dataframes loaded into a single GPU (ala to_pandas), or multiple GPUs (to_dask). That sounds like. So we could either have:

  1. Something like to_pandas(engine="cudf") and to_dask(engine="cudf")
  2. Something like to_cudf() (which uses cudf.read_parquet) or to_dask_cudf (which uses dask_cudf.read_parquet).

One problem with Option 1 is that the cudf_read_parquet reader has engine="pyarrow" too, and that would conflict with pandas.read_parquet. We could workaround that (using .to_pandas(engine="pyarrow", backend="gpu") but that might get ugly.

How many of the sources do you think it would apply to? I know cuDF have performant parquet and CSV readers.

Looking at cudf's IO readers at https://docs.rapids.ai/api/cudf/stable/api.html#module-cudf.io.csv, these file formats are currently available:

Perhaps we should discuss this upstream at https://github.com/intake/intake too 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants