Add a to_cudf method for reading directly into GPU memory #17

weiji14 · 2020-08-21T21:29:07Z

Hi there,

Just wondering if there's scope for a to_cudf type functionality so that users can read Parquet files directly into GPU memory (bypassing the CPU). This would be using the cudf.read_parquet function.

Happy to submit a Pull Request for this, but would like to have a discussion around the implementation, whether it should be handled as a to_cudf method, or via something like engine="cudf" (though cudf also has a "pyarrow" engine like pandas).

One issue though is that cudf cannot read multi-file Parquet folders yet (see rapidsai/cudf#1688), only single binary parquet files. This might get implemented in the future (v0.16?) cudf release though.

The text was updated successfully, but these errors were encountered:

martindurant · 2020-08-22T19:31:38Z

I could see it either way, as an argument to to_pandas (and/or to_dask), or as its own method. How many of the sources do you think it would apply to? I know cuDF have performant parquet and CSV readers.

weiji14 · 2020-08-22T22:14:49Z

I could see it either way, as an argument to to_pandas (and/or to_dask), or as its own method.

True, since it's possible to have dataframes loaded into a single GPU (ala to_pandas), or multiple GPUs (to_dask). That sounds like. So we could either have:

Something like to_pandas(engine="cudf") and to_dask(engine="cudf")
Something like to_cudf() (which uses cudf.read_parquet) or to_dask_cudf (which uses dask_cudf.read_parquet).

One problem with Option 1 is that the cudf_read_parquet reader has engine="pyarrow" too, and that would conflict with pandas.read_parquet. We could workaround that (using .to_pandas(engine="pyarrow", backend="gpu") but that might get ugly.

How many of the sources do you think it would apply to? I know cuDF have performant parquet and CSV readers.

Looking at cudf's IO readers at https://docs.rapids.ai/api/cudf/stable/api.html#module-cudf.io.csv, these file formats are currently available:

Perhaps we should discuss this upstream at https://github.com/intake/intake too 😁

weiji14 mentioned this issue Aug 22, 2020

Loading data directly into GPU memory with to_cudf and to_dask_cudf intake/intake#525

Open

weiji14 linked a pull request Oct 11, 2020 that will close this issue

Implement to_cudf method for reading directly into GPU memory #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a to_cudf method for reading directly into GPU memory #17

Add a to_cudf method for reading directly into GPU memory #17

weiji14 commented Aug 21, 2020 •

edited

Loading

martindurant commented Aug 22, 2020

weiji14 commented Aug 22, 2020 •

edited

Loading

Add a to_cudf method for reading directly into GPU memory #17

Add a to_cudf method for reading directly into GPU memory #17

Comments

weiji14 commented Aug 21, 2020 • edited Loading

martindurant commented Aug 22, 2020

weiji14 commented Aug 22, 2020 • edited Loading

weiji14 commented Aug 21, 2020 •

edited

Loading

weiji14 commented Aug 22, 2020 •

edited

Loading