Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Computing a dask shuffle returns a pd.DataFrame, not gpd.GeoDataFrame #116

Closed
tastatham opened this issue Sep 24, 2021 · 2 comments · Fixed by #142
Closed

BUG: Computing a dask shuffle returns a pd.DataFrame, not gpd.GeoDataFrame #116

tastatham opened this issue Sep 24, 2021 · 2 comments · Fixed by #142

Comments

@tastatham
Copy link
Contributor

Using Dask shuffle to lazily partition a Dask GeoDataFrame correctly returns a dask_geopandas.core.GeoDataFrame. However, computing the task returns a pd.DataFrame not a gpd.GeoDataFrame. Below is a quick example;

import geopandas
import dask_geopandas

gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
ddf = dask_geopandas.from_geopandas(gdf, 4)

ddf["hilbert"] = ddf.hilbert_distance(p=10)

def spatial_shuffle(ddf):
    return ddf.shuffle(on="hilbert", npartitions=20)

output = spatial_shuffle(ddf)
print(type(output))

print(type(output.compute()))

def spatial_shuffle2(ddf):    
    ddf = ddf.shuffle(on="hilbert", npartitions=20)
    return ddf.set_geometry("geometry")

output2 = spatial_shuffle2(ddf).compute()
print(type(output2))

I will have a look at the Dask shuffle function further and follow this up. It could well be an indexing/conversion to pd.DatraFrame issue #L358.

@jorisvandenbossche
Copy link
Member

@tastatham thanks for the clear example. Looking into it, now I remember that I previously ran into the same problem with set_index: #59. And the problem lies in serialization/deserialization step for which dask uses partd: dask/partd#52

The set_geometry is probably a good workaround for now.

@tastatham
Copy link
Contributor Author

@tastatham thanks for the clear example. Looking into it, now I remember that I previously ran into the same problem with set_index: #59. And the problem lies in serialization/deserialization step for which dask uses partd: dask/partd#52

The set_geometry is probably a good workaround for now.

ah I see. Since this is a Dask issue, I will use set_geometry as a workaround for the "spatial" shuffle for now and we can drop this later when this has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants