Parquet Reboot #10602

mrocklin · 2023-10-26T17:54:40Z

Our parquet performance is bad. I get 20MB/s in real-world use cases on the cloud where I would expect 500 MB/s. This accounts for ~80% of our runtime in complex dataframe queries in TPC-H. Systems like P2P, the GIL, Pandas copy-on-write, PyArrow strings, etc, are all inconsequential relative to this performance bottleneck.

Unfortunately, improving this situation is difficult because our current Parquet implementation has many layers and many systems all interwoven with each other. What should be a relatively simple system is today somewhat opaque. I would like for us to consider a rewrite.

There are many things to consider here. I'll follow up with some personal thoughts, and I welcome thoughts from others.

mrocklin · 2023-10-26T18:01:02Z

Personal recommendation

If I were to build this I would remove a lot. I'd remove engines and go Arrow-first with both parquet deserialization and with data access. I think that we could make this a single-python-file effort that had reliably good performance and much easier maintenance.

We would lose some features:

fastparquet
index / divisions
metadata file
Probably we need to maintain a filesystem= keyword in case someone has an odd filesystem and needs to use fsspec, but this would be the exception I think
Probably the loss of engines means that RAPIDS folks need to maintain their own read-parquet code paths outside of Dask. I think that this is ok, they've done that before with dask-cudf. Probably they just do it again.

Some things to gain

There are various settings we should be aware of from [Python][Parquet] Parquet deserialization speeds slower on Linux apache/arrow#38389 that might help improve performance
Consider keeping RLE columns as categorical
Preferring LZ4 compression over snappy (lz4 used to not be supported, but maybe is today, and is maybe faster?)
Use arrow filesystems by default when available

phofl · 2023-10-26T18:06:06Z

Having a smaller environment because you don’t have to install arrow is the main benefit of fastparquet as far as I am aware of. Pandas will require arrow as a dependency starting with the 3.0 release if there aren’t any unexpected developments. This would render the environment size argument irrelevant

mrocklin · 2023-10-26T18:08:02Z

There are likely other benefits to fastparquet, but mostly I just want one system rather than having to support many systems and while I like fastparquet, I like pyarrow marginally more.

fjetter · 2023-10-27T08:14:21Z

+1

I think specializing on pyarrow will be the biggest benefit here that allows to cut complexity.

Additionally, I would like us to review closely how we want to deal with files and rowgroups and the collection of statistics.

I'm very strongly biased to not collect statistics of the entire dataset automatically (if ever) but I appreciate that there are use cases where we have to deal with parquet metadata. I would like us to revisit the existing logic and invest time in reproducers that support the features we are offering and allow us to run tests for them.

Re: single-file-effort. I believe this is something that we also have to review a little more closely. Especially with the introduction of FusedIO in dask-expr there are significant optimizations available when treating multiple files in a batch (pre-fetching IO, for instance). pyarrow implements these capabilities transparently to a certain degree.

mrocklin · 2023-10-27T11:59:02Z

I'm going to take some time this morning and prototype a very simple implementation. If it works I'll use it for TPC-H. We'll see how far I get. I give it a 50/50 chance that I give up because it's too complex.

This is a dumb, mostly-from-scratch implementation of read_parquet. It only supports - local and s3 - column selection - grouping partitions when we have fewer columns (+ threads!) - arrow engine/filesystem It is very broken in many ways, but ... - It's only around 100 lines of code - I get 250 MB/s bandwidth on full column reads on an m6i.xlarge (only 50 MB/s when reading columns though) See dask/dask#10602

mrocklin · 2023-10-27T13:48:32Z

dask/dask-expr#373

This is a dumb, mostly-from-scratch implementation of read_parquet. It only supports - local and s3 - column selection - grouping partitions when we have fewer columns (+ threads!) - arrow engine/filesystem It is very broken in many ways, but ... - It's only around 100 lines of code - I get 250 MB/s bandwidth on full column reads on an m6i.xlarge (only 50 MB/s when reading columns though) See dask/dask#10602

rjzamora · 2023-10-31T17:55:35Z

Cool - Sorry, I saw dask-contrib/dask-expr#373 but I missed this issue. 

I agree that we should simplify things in parquet land, and the best way to do this is to re-write things from scratch. The parquet-engine landscape is dramatically different than it was when the Engine system was first introduced - The PyArrow API and feature set has completely changed since then, and RAPIDS is pretty adaptable on the parquet front anyway.

If you want dd.read_parquet to support hive-partitioned datasets and produce reliably “good” partition sizes, you will end up with a lot of the same bells and whistles that exist now, but you will have the opportunity to organize things in a way that makes sense for the current PyArrow API. Also, now that we have dask-Expr in the mix, the partition-size issue can probably be re-cast as an optimization pass instead of read_parquet arguments.

Somewhat related thoughts: I realize that you are not personally concerned about dask-cudf/rapids - I think that is okay if we can find a way to be smart about “dispatching” Expr classes in Dask-Expr (e.g. some variation or better alternative to dask/dask-expr#321). If downstream projects have some way to replace specific Expr definitions that don’t suit then, then we can probably avoid the kinds of tech debt that the Engine design has caused in the past. One possible outcome is that we use Dask-Expr to start stripping out unnecessary backend/compute dispatching logic in favor of a lower-maintenance coarse-grained dispatching approach.

mrocklin · 2023-10-31T21:14:36Z

Yeah, I'm hoping that we can defer a lot of the fancy parquet logic to Arrow. ParquetDataset has a lot of stuff like hive partitioning, yes?

Mostly I want to avoid a situation where things get sufficiently complex that people stop maintaining them. This seems to be what happened with parquet. We all knew that it was a trainwreck, but no one wanted to spend the time to make it good.

If there is a plan that makes things easy for others to extend and that doesn't make life harder for folks maintaining the core then I have no objection.

rjzamora · 2023-10-31T22:19:08Z

I'm hoping that we can defer a lot of the fancy parquet logic to Arrow. ParquetDataset has a lot of stuff like hive partitioning, yes?

Yes, I completely agree that we should defer as much as possible PyArrow. The PyArrow feature set is probably ready to take the reins.

We all knew that it was a trainwreck, but no one wanted to spend the time to make it good.

Yes, it is a mess, but I don't think it is completely a question of time or effort. I'm pretty sure people have just been afraid of breaking the complicated PyArrow/FastParquet/Dask-cuDF/Dask-GeoPandas/etc API contract that was designed at a time when Dask needed to implement most of the ugly dataset parsing/planning logic. We were obviously trying to do something "good" by building a pluggable read_parquet architecture, but the original goal has proven impractical over time.

If there is a plan that makes things easy for others to extend and that doesn't make life harder for folks maintaining the core then I have no objection.

That's good to hear. I'm also pretty motivated to minimize maintenance burden.

Note: It may also be valuable to define simple Expr templates for custom IO Expressions (similar to the way from_map makes it relatively easy to implement custom IO solutions in dask.dataframe). It seems ideal to focus on a simple/performant API and to make it easy for power-users to implement their own custom logic when they need it.

mrocklin · 2023-10-31T23:10:14Z

Note: It may also be valuable to define simple Expr templates for custom IO Expressions (similar to the way from_map makes it relatively easy to implement custom IO solutions in dask.dataframe). It seems ideal to focus on a simple/performant API and to make it easy for power-users to implement their own custom logic when they need it

As a heads-up, I'll likely be far more nervous / hesitant around reworking things into a complex hierarchy. You should probably expect a lot more friction and demands for design docs ahead of time for anything like this. I'm likely going to push to keep things very flat and simple.

rjzamora · 2023-10-31T23:49:26Z

Hmm. That was not a suggestion for any kind of hierarchy at all, so not sure I understand. Either way, I hear you.

mfenner1 · 2023-11-09T16:51:22Z

I'd like to comment that if the current parquet code is going to be in "flux" and if the current system isn't really solid/maintained/maintainable, then maybe the advice/best practice of using Parquet (https://docs.dask.org/en/latest/dataframe-best-practices.html#use-parquet) should be (temporarily) removed until a more workable Parquet system is in place.

fjetter · 2023-11-10T08:22:01Z

" and if the current system isn't really solid/maintained/maintainable, then maybe the advice/best practice of using Parquet

Thanks for raising this concern. I believe the API as is is not in flux and it is usable and maintained. It is difficult to maintain and our development velocity is suffering from complexity so we're considering a rewrite.

We haven't made a final decision on how exactly to proceed with this but we will make sure that the transition will be as smooth as possible for users. The one recommendation I can give right now is to stay clear of the already deprecated features that are already raising warnings. They will certainly cease to exist.

Even with all it's flaws in the current implementation, parquet is still far superior to other storage formats for tabular data and this advice in the documentation is still very accurate.
This issue should not scare you away from using parquet but rather encourage you to use it because it will get even better down the road.

mrocklin · 2023-12-20T16:35:43Z

@fjetter I'm curious about plans here. As I think through docs and Dask messaging there's a bunch of stuff I want to say about Dask and Spark and performance and my guess is that we'll need to address this before we can say good things there with any strength. I know that this topic isn't top of mind for you right now, but I'd like to align on roughly when something like a performant parquet system might arise. No immediate timing pressure on this, but I'd certainly like to align on this, say, early in 2024.

fjetter · 2023-12-20T17:06:10Z

In terms of prioritization I consider getting dask-expr into main dask the most important thing right now. There is still a lot of uncertainty so nobody is working on parquet right now. I hope that the dust settles towards end of January such that we gain more confidence about dask-expr and we can pick parquet up again. So, my best guess is end of Q1

rjzamora · 2023-12-20T18:21:09Z

Just a note that I have started experimenting with better ways to handle options like blocksize in the "future" read_parquet implementation. Florian has made the point several times that we shouldn't ever need to parse all the parquet metadata/statistics up-front, and I completely agree with him.

My assumption is that Coiled wants to take ownership of the parquet reboot. That would be welcome on my end, but I'm also happy to help and engage wherever it would be useful.

mrocklin · 2023-12-20T18:31:36Z

To be clear, anyone is welcome to propose changes / reinventions of Parquet. We'll eventually take responsibility here assuming no one else does (which is our current assumption to be safe).

…

On Wed, Dec 20, 2023 at 10:21 AM Richard (Rick) Zamora < ***@***.***> wrote: Just a note that I have started experimenting <https://gist.github.com/rjzamora/2c55d7e1c614b7df8f27dd92205a43fa> with better ways to handle options like blocksize in the "future" read_parquet implementation. Florian has made the point several times that we shouldn't ever need to parse all the parquet metadata/statistics up-front, and I completely agree with him. My assumption is that Coiled wants to take ownership of the parquet reboot. That would be welcome on my end, but I'm also happy to help and engage wherever it would be useful. — Reply to this email directly, view it on GitHub <#10602 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTAPQ67LFYHB5SN65T3YKMUCBAVCNFSM6AAAAAA6RTLYZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUHEZTANRUGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

phofl · 2024-01-17T14:01:31Z

Probably we need to maintain a filesystem= keyword in case someone has an odd filesystem and needs to use fsspec, but this would be the exception I think

Just found out that arrow doesn't support azure blob storage, so that' a good argument to keep the filesystem keyword

fabiopicchi · 2024-11-29T14:13:01Z

I am using pandas to load a dataset, and fastparquet is more than twice as fast. I wonder if this is due to particularities of this dataset, but if not, I believe fastparquet still has a lot of value. The loading operation was executed 20 times for each engine:

pyarrow	fastparquet
3.73 +- 0.28	1.33 +- 0.03

(timings are in seconds)

rjzamora · 2024-12-03T00:41:55Z

I believe fastparquet still has a lot of value

Fastparquet certainly does have a lot of value. However, it has been challenging to maintain multiple backends via dd.read_parquet in the past. Given that "modern" Pandas relies pretty heavily on PyArrow at this point, it's just makes sense to focus on that particular backend for now.

With that said, it should be pretty easy to use fastparquet with dd.from_map if there are performance advantages for your particular workflow.

github-actions bot added the needs triage Needs a response from a contributor label Oct 26, 2023

mrocklin mentioned this issue Oct 27, 2023

DNM: Dumb Read Parquet Implementation dask/dask-expr#373

Open

rjzamora mentioned this issue Dec 19, 2023

test_split_adaptive_aggregate_files failing on main #10721

Closed

fjetter added discussion Discussing a topic with no specific actions yet parquet and removed needs triage Needs a response from a contributor labels Dec 19, 2023

fjetter mentioned this issue Dec 19, 2023

Deprecate fastparquet engine for read_parquet to enable switch to dask-expr #10725

Closed

crusaderky mentioned this issue Jan 2, 2024

Deprecate fastparquet #10743

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet Reboot #10602

Parquet Reboot #10602

mrocklin commented Oct 26, 2023

mrocklin commented Oct 26, 2023 •

edited

Loading

phofl commented Oct 26, 2023

mrocklin commented Oct 26, 2023

fjetter commented Oct 27, 2023

mrocklin commented Oct 27, 2023

mrocklin commented Oct 27, 2023

rjzamora commented Oct 31, 2023

mrocklin commented Oct 31, 2023

rjzamora commented Oct 31, 2023 •

edited

Loading

mrocklin commented Oct 31, 2023

rjzamora commented Oct 31, 2023

mfenner1 commented Nov 9, 2023

fjetter commented Nov 10, 2023 •

edited

Loading

mrocklin commented Dec 20, 2023

fjetter commented Dec 20, 2023

rjzamora commented Dec 20, 2023

mrocklin commented Dec 20, 2023 via email

phofl commented Jan 17, 2024

fabiopicchi commented Nov 29, 2024

rjzamora commented Dec 3, 2024

Parquet Reboot #10602

Parquet Reboot #10602

Comments

mrocklin commented Oct 26, 2023

mrocklin commented Oct 26, 2023 • edited Loading

Personal recommendation

We would lose some features:

Some things to gain

phofl commented Oct 26, 2023

mrocklin commented Oct 26, 2023

fjetter commented Oct 27, 2023

mrocklin commented Oct 27, 2023

mrocklin commented Oct 27, 2023

rjzamora commented Oct 31, 2023

mrocklin commented Oct 31, 2023

rjzamora commented Oct 31, 2023 • edited Loading

mrocklin commented Oct 31, 2023

rjzamora commented Oct 31, 2023

mfenner1 commented Nov 9, 2023

fjetter commented Nov 10, 2023 • edited Loading

mrocklin commented Dec 20, 2023

fjetter commented Dec 20, 2023

rjzamora commented Dec 20, 2023

mrocklin commented Dec 20, 2023 via email

phofl commented Jan 17, 2024

fabiopicchi commented Nov 29, 2024

rjzamora commented Dec 3, 2024

mrocklin commented Oct 26, 2023 •

edited

Loading

rjzamora commented Oct 31, 2023 •

edited

Loading

fjetter commented Nov 10, 2023 •

edited

Loading