Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Reboot #10602

Open
mrocklin opened this issue Oct 26, 2023 · 20 comments
Open

Parquet Reboot #10602

mrocklin opened this issue Oct 26, 2023 · 20 comments
Labels
discussion Discussing a topic with no specific actions yet parquet

Comments

@mrocklin
Copy link
Member

Our parquet performance is bad. I get 20MB/s in real-world use cases on the cloud where I would expect 500 MB/s. This accounts for ~80% of our runtime in complex dataframe queries in TPC-H. Systems like P2P, the GIL, Pandas copy-on-write, PyArrow strings, etc, are all inconsequential relative to this performance bottleneck.

Unfortunately, improving this situation is difficult because our current Parquet implementation has many layers and many systems all interwoven with each other. What should be a relatively simple system is today somewhat opaque. I would like for us to consider a rewrite.

There are many things to consider here. I'll follow up with some personal thoughts, and I welcome thoughts from others.

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Oct 26, 2023
@mrocklin
Copy link
Member Author

mrocklin commented Oct 26, 2023

Personal recommendation

If I were to build this I would remove a lot. I'd remove engines and go Arrow-first with both parquet deserialization and with data access. I think that we could make this a single-python-file effort that had reliably good performance and much easier maintenance.

We would lose some features:

  • fastparquet
  • index / divisions
  • metadata file
  • Probably we need to maintain a filesystem= keyword in case someone has an odd filesystem and needs to use fsspec, but this would be the exception I think
  • Probably the loss of engines means that RAPIDS folks need to maintain their own read-parquet code paths outside of Dask. I think that this is ok, they've done that before with dask-cudf. Probably they just do it again.

Some things to gain

@phofl
Copy link
Collaborator

phofl commented Oct 26, 2023

Having a smaller environment because you don’t have to install arrow is the main benefit of fastparquet as far as I am aware of. Pandas will require arrow as a dependency starting with the 3.0 release if there aren’t any unexpected developments. This would render the environment size argument irrelevant

@mrocklin
Copy link
Member Author

There are likely other benefits to fastparquet, but mostly I just want one system rather than having to support many systems and while I like fastparquet, I like pyarrow marginally more.

@fjetter
Copy link
Member

fjetter commented Oct 27, 2023

+1

I think specializing on pyarrow will be the biggest benefit here that allows to cut complexity.

Additionally, I would like us to review closely how we want to deal with files and rowgroups and the collection of statistics.

I'm very strongly biased to not collect statistics of the entire dataset automatically (if ever) but I appreciate that there are use cases where we have to deal with parquet metadata. I would like us to revisit the existing logic and invest time in reproducers that support the features we are offering and allow us to run tests for them.

Re: single-file-effort. I believe this is something that we also have to review a little more closely. Especially with the introduction of FusedIO in dask-expr there are significant optimizations available when treating multiple files in a batch (pre-fetching IO, for instance). pyarrow implements these capabilities transparently to a certain degree.

@mrocklin
Copy link
Member Author

I'm going to take some time this morning and prototype a very simple implementation. If it works I'll use it for TPC-H. We'll see how far I get. I give it a 50/50 chance that I give up because it's too complex.

mrocklin added a commit to mrocklin/dask-expr that referenced this issue Oct 27, 2023
This is a dumb, mostly-from-scratch implementation of read_parquet.

It only supports
-  local and s3
-  column selection
-  grouping partitions when we have fewer columns (+ threads!)
-  arrow engine/filesystem

It is very broken in many ways, but ...

-  It's only around 100 lines of code
-  I get 250 MB/s bandwidth on full column reads on an m6i.xlarge
   (only 50 MB/s when reading columns though)

See dask/dask#10602
@mrocklin
Copy link
Member Author

dask/dask-expr#373

mrocklin added a commit to mrocklin/dask-expr that referenced this issue Oct 30, 2023
This is a dumb, mostly-from-scratch implementation of read_parquet.

It only supports
-  local and s3
-  column selection
-  grouping partitions when we have fewer columns (+ threads!)
-  arrow engine/filesystem

It is very broken in many ways, but ...

-  It's only around 100 lines of code
-  I get 250 MB/s bandwidth on full column reads on an m6i.xlarge
   (only 50 MB/s when reading columns though)

See dask/dask#10602
@rjzamora
Copy link
Member

Cool - Sorry, I saw dask-contrib/dask-expr#373 but I missed this issue.


I agree that we should simplify things in parquet land, and the best way to do this is to re-write things from scratch. The parquet-engine landscape is dramatically different than it was when the Engine system was first introduced - The PyArrow API and feature set has completely changed since then, and RAPIDS is pretty adaptable on the parquet front anyway.

If you want dd.read_parquet to support hive-partitioned datasets and produce reliably “good” partition sizes, you will end up with a lot of the same bells and whistles that exist now, but you will have the opportunity to organize things in a way that makes sense for the current PyArrow API. Also, now that we have dask-Expr in the mix, the partition-size issue can probably be re-cast as an optimization pass instead of read_parquet arguments.

Somewhat related thoughts: I realize that you are not personally concerned about dask-cudf/rapids - I think that is okay if we can find a way to be smart about “dispatching” Expr classes in Dask-Expr (e.g. some variation or better alternative to dask/dask-expr#321). If downstream projects have some way to replace specific Expr definitions that don’t suit then, then we can probably avoid the kinds of tech debt that the Engine design has caused in the past. One possible outcome is that we use Dask-Expr to start stripping out unnecessary backend/compute dispatching logic in favor of a lower-maintenance coarse-grained dispatching approach.

@mrocklin
Copy link
Member Author

Yeah, I'm hoping that we can defer a lot of the fancy parquet logic to Arrow. ParquetDataset has a lot of stuff like hive partitioning, yes?

Mostly I want to avoid a situation where things get sufficiently complex that people stop maintaining them. This seems to be what happened with parquet. We all knew that it was a trainwreck, but no one wanted to spend the time to make it good.

If there is a plan that makes things easy for others to extend and that doesn't make life harder for folks maintaining the core then I have no objection.

@rjzamora
Copy link
Member

rjzamora commented Oct 31, 2023

I'm hoping that we can defer a lot of the fancy parquet logic to Arrow. ParquetDataset has a lot of stuff like hive partitioning, yes?

Yes, I completely agree that we should defer as much as possible PyArrow. The PyArrow feature set is probably ready to take the reins.

We all knew that it was a trainwreck, but no one wanted to spend the time to make it good.

Yes, it is a mess, but I don't think it is completely a question of time or effort. I'm pretty sure people have just been afraid of breaking the complicated PyArrow/FastParquet/Dask-cuDF/Dask-GeoPandas/etc API contract that was designed at a time when Dask needed to implement most of the ugly dataset parsing/planning logic. We were obviously trying to do something "good" by building a pluggable read_parquet architecture, but the original goal has proven impractical over time.

If there is a plan that makes things easy for others to extend and that doesn't make life harder for folks maintaining the core then I have no objection.

That's good to hear. I'm also pretty motivated to minimize maintenance burden.

Note: It may also be valuable to define simple Expr templates for custom IO Expressions (similar to the way from_map makes it relatively easy to implement custom IO solutions in dask.dataframe). It seems ideal to focus on a simple/performant API and to make it easy for power-users to implement their own custom logic when they need it.

@mrocklin
Copy link
Member Author

Note: It may also be valuable to define simple Expr templates for custom IO Expressions (similar to the way from_map makes it relatively easy to implement custom IO solutions in dask.dataframe). It seems ideal to focus on a simple/performant API and to make it easy for power-users to implement their own custom logic when they need it

As a heads-up, I'll likely be far more nervous / hesitant around reworking things into a complex hierarchy. You should probably expect a lot more friction and demands for design docs ahead of time for anything like this. I'm likely going to push to keep things very flat and simple.

@rjzamora
Copy link
Member

Hmm. That was not a suggestion for any kind of hierarchy at all, so not sure I understand. Either way, I hear you.

@mfenner1
Copy link

mfenner1 commented Nov 9, 2023

I'd like to comment that if the current parquet code is going to be in "flux" and if the current system isn't really solid/maintained/maintainable, then maybe the advice/best practice of using Parquet (https://docs.dask.org/en/latest/dataframe-best-practices.html#use-parquet) should be (temporarily) removed until a more workable Parquet system is in place.

@fjetter
Copy link
Member

fjetter commented Nov 10, 2023

" and if the current system isn't really solid/maintained/maintainable, then maybe the advice/best practice of using Parquet

Thanks for raising this concern. I believe the API as is is not in flux and it is usable and maintained. It is difficult to maintain and our development velocity is suffering from complexity so we're considering a rewrite.

We haven't made a final decision on how exactly to proceed with this but we will make sure that the transition will be as smooth as possible for users. The one recommendation I can give right now is to stay clear of the already deprecated features that are already raising warnings. They will certainly cease to exist.

Even with all it's flaws in the current implementation, parquet is still far superior to other storage formats for tabular data and this advice in the documentation is still very accurate.
This issue should not scare you away from using parquet but rather encourage you to use it because it will get even better down the road.

@fjetter fjetter added discussion Discussing a topic with no specific actions yet parquet and removed needs triage Needs a response from a contributor labels Dec 19, 2023
@mrocklin
Copy link
Member Author

@fjetter I'm curious about plans here. As I think through docs and Dask messaging there's a bunch of stuff I want to say about Dask and Spark and performance and my guess is that we'll need to address this before we can say good things there with any strength. I know that this topic isn't top of mind for you right now, but I'd like to align on roughly when something like a performant parquet system might arise. No immediate timing pressure on this, but I'd certainly like to align on this, say, early in 2024.

@fjetter
Copy link
Member

fjetter commented Dec 20, 2023

In terms of prioritization I consider getting dask-expr into main dask the most important thing right now. There is still a lot of uncertainty so nobody is working on parquet right now. I hope that the dust settles towards end of January such that we gain more confidence about dask-expr and we can pick parquet up again. So, my best guess is end of Q1

@rjzamora
Copy link
Member

Just a note that I have started experimenting with better ways to handle options like blocksize in the "future" read_parquet implementation. Florian has made the point several times that we shouldn't ever need to parse all the parquet metadata/statistics up-front, and I completely agree with him.

My assumption is that Coiled wants to take ownership of the parquet reboot. That would be welcome on my end, but I'm also happy to help and engage wherever it would be useful.

@mrocklin
Copy link
Member Author

mrocklin commented Dec 20, 2023 via email

@phofl
Copy link
Collaborator

phofl commented Jan 17, 2024

Probably we need to maintain a filesystem= keyword in case someone has an odd filesystem and needs to use fsspec, but this would be the exception I think

Just found out that arrow doesn't support azure blob storage, so that' a good argument to keep the filesystem keyword

@fabiopicchi
Copy link

I am using pandas to load a dataset, and fastparquet is more than twice as fast. I wonder if this is due to particularities of this dataset, but if not, I believe fastparquet still has a lot of value. The loading operation was executed 20 times for each engine:

pyarrow fastparquet
3.73 +- 0.28 1.33 +- 0.03

(timings are in seconds)

@rjzamora
Copy link
Member

rjzamora commented Dec 3, 2024

I believe fastparquet still has a lot of value

Fastparquet certainly does have a lot of value. However, it has been challenging to maintain multiple backends via dd.read_parquet in the past. Given that "modern" Pandas relies pretty heavily on PyArrow at this point, it's just makes sense to focus on that particular backend for now.

With that said, it should be pretty easy to use fastparquet with dd.from_map if there are performance advantages for your particular workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet parquet
Projects
None yet
Development

No branches or pull requests

6 participants