Skip to content

Commit

Permalink
DNM: Dumb Read Parquet Implementation
Browse files Browse the repository at this point in the history
This is a dumb, mostly-from-scratch implementation of read_parquet.

It only supports
-  local and s3
-  column selection
-  grouping partitions when we have fewer columns (+ threads!)
-  arrow engine/filesystem

It is very broken in many ways, but ...

-  It's only around 100 lines of code
-  I get 250 MB/s bandwidth on full column reads on an m6i.xlarge
   (only 50 MB/s when reading columns though)

See dask/dask#10602
  • Loading branch information
mrocklin committed Oct 30, 2023
1 parent 810995a commit c110f6e
Show file tree
Hide file tree
Showing 3 changed files with 148 additions and 471 deletions.
33 changes: 1 addition & 32 deletions dask_expr/_collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -1171,47 +1171,16 @@ def read_csv(path, *args, usecols=None, **kwargs):
def read_parquet(
path=None,
columns=None,
filters=None,
categories=None,
index=None,
storage_options=None,
dtype_backend=None,
calculate_divisions=False,
ignore_metadata_file=False,
metadata_task_size=None,
split_row_groups="infer",
blocksize="default",
aggregate_files=None,
parquet_file_extension=(".parq", ".parquet", ".pq"),
filesystem="fsspec",
engine=None,
**kwargs,
):
from dask_expr.io.parquet import ReadParquet, _set_parquet_engine
from dask_expr.io.parquet import ReadParquet

if not isinstance(path, str):
path = stringify_path(path)

kwargs["dtype_backend"] = dtype_backend

return new_collection(
ReadParquet(
path,
columns=_convert_to_list(columns),
filters=filters,
categories=categories,
index=index,
storage_options=storage_options,
calculate_divisions=calculate_divisions,
ignore_metadata_file=ignore_metadata_file,
metadata_task_size=metadata_task_size,
split_row_groups=split_row_groups,
blocksize=blocksize,
aggregate_files=aggregate_files,
parquet_file_extension=parquet_file_extension,
filesystem=filesystem,
engine=_set_parquet_engine(engine),
kwargs=kwargs,
)
)

Expand Down
Loading

0 comments on commit c110f6e

Please sign in to comment.