Implement to_cudf method for reading directly into GPU memory #19

weiji14 · 2020-10-11T03:50:51Z

Adds a to_cudf method to allow reading parquet files into cudf.DataFrame objects. Fixes #17.

Let me know if I should add more docstrings or tests, this PR is fairly minimal at this point.

martindurant · 2020-10-13T13:24:32Z

Is it not the case that the engine="cudf" keyword is actually ignored here, and it's only calling to_cudf that's important?
Passing that engine and then naively calling read() I suppose would error?

Furthermore, I think that as coded, you would get different behaviour for read/to_dask depending on whether you had called to_cudf first or not.

weiji14 · 2020-10-13T19:25:15Z

Is it not the case that the engine="cudf" keyword is actually ignored here, and it's only calling to_cudf that's important?
Passing that engine and then naively calling read() I suppose would error?

Good point. Perhaps the engine argument should be passed in directly when reading, i.e. as to_cudf(engine='cudf') or to_cudf(engine='pyarrow'), instead of a more global kwarg. Same for read and to_dask, but this would be a breaking change, albeit one that might be necessary to offer more fine grained control over reading parquet files.

Furthermore, I think that as coded, you would get different behaviour for read/to_dask depending on whether you had called to_cudf first or not.

Will need to test this out, do you mean the reuse of the self._df variable?

martindurant · 2020-10-13T19:41:29Z

It feels like you can either have the engine kwargs or the to_cudf method, not both. Either of these could solve the requirement, and I think the former may be the simpler.

do you mean the reuse of the self._df variable

Exactly - which is why I have a slight preference to setting the engine in init

weiji14 · 2020-10-14T03:15:00Z

It feels like you can either have the engine kwargs or the to_cudf method, not both. Either of these could solve the requirement, and I think the former may be the simpler.

Yes the former implementation of using only engine is simpler. However, engine="pyarrow" is available for both the cudf.read_parquet and pd.read_parquet readers, so what happens when engine="pyarrow" is chosen, do we read using pandas or cudf? These are the engines supported by the two readers:

engine	`cudf.read_parquet`	`pd.read_parquet`
cudf	✔️
pyarrow	✔️	✔️
fastparquet		✔️

Maybe we should deprecate the 'engine' kwarg, and pass it engine in at the to_cudf or to_dask readers? But again, this would be a backward incompatible change, so might need to think this over a bit more.

martindurant · 2020-10-14T12:59:16Z

I would have engine=

fastparquet (implies pandas),
(py)arrow (implies pandas),
cudf (implies pyarrow)

and then any of these can work with read() and to_dask(); the engine= parameter already has this meaning for pandas and Dask.

weiji14 · 2020-10-23T21:29:20Z

and then any of these can work with read() and to_dask(); the engine= parameter already has this meaning for pandas and Dask.

I'm ok with this. However, the implementation will be harder, and the codebase will need to change significantly to handle this. Currently, the parquet dataset is loaded lazily via dask first:

intake-parquet/intake_parquet/source.py

Lines 59 to 60 in e3eab2a

    
           if self._df is None: 
        
               self._df = self._to_dask()

and .compute() is called when using read to return a pandas.DataFrame:

intake-parquet/intake_parquet/source.py

Lines 73 to 78 in e3eab2a

    
               def read(self): 
        
                   """ 
        
                   Create single pandas dataframe from the whole data-set 
        
                   """ 
        
                   self._load_metadata() 
        
                   return self._df.compute()

In the cudf world however, this would imply that dask_cudf will always be needed, but dask_cudf isn't a dependency anyone would need unless they have more than 1 GPU.

So yes, we could go with engine='cudf', but this would involve a significant refactoring effort on the backend.

martindurant · 2020-10-23T22:00:18Z

Please remind me next week, and I can try: I'm sure we don't need to introduce cudf as a dependency in order to support the engine= keyword.

weiji14 · 2020-10-23T22:06:08Z

Sure, let's check next week.

Oh, and we definitely don't need to introduce cudf as a dependency in intake-parquet. What I want to avoid is that people need to install both dask_cudf and cudf in order to use this new functionality.

martindurant · 2020-10-26T20:21:19Z

I see. I don't know how dask/cudf are implemented internally, it should be possible to get the base information without dask and then pass to dask-cudf when calling to_dask. This is also a shortcoming of the non-cudf branch, dask is assumed in many places for various drivers.

martindurant · 2023-05-12T15:13:37Z

Completely forgot about this from so long ago, sorry. #28 does some similar work to make dask optional, so passing the engine through will work now. If there is still interest, that is.
Separately, we are considering completely pulling apart the "file type" and "backend reader" logic in Intake generally, which would lead to far more but much simpler reader classes.

Implement to_cudf method for reading directly into GPU memory

de02a09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement to_cudf method for reading directly into GPU memory #19

Implement to_cudf method for reading directly into GPU memory #19

weiji14 commented Oct 11, 2020

martindurant commented Oct 13, 2020

weiji14 commented Oct 13, 2020

martindurant commented Oct 13, 2020

weiji14 commented Oct 14, 2020

martindurant commented Oct 14, 2020

weiji14 commented Oct 23, 2020

martindurant commented Oct 23, 2020

weiji14 commented Oct 23, 2020

martindurant commented Oct 26, 2020

martindurant commented May 12, 2023

Implement to_cudf method for reading directly into GPU memory #19

Are you sure you want to change the base?

Implement to_cudf method for reading directly into GPU memory #19

Conversation

weiji14 commented Oct 11, 2020

martindurant commented Oct 13, 2020

weiji14 commented Oct 13, 2020

martindurant commented Oct 13, 2020

weiji14 commented Oct 14, 2020

martindurant commented Oct 14, 2020

weiji14 commented Oct 23, 2020

martindurant commented Oct 23, 2020

weiji14 commented Oct 23, 2020

martindurant commented Oct 26, 2020

martindurant commented May 12, 2023