-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet Reboot #10602
Comments
Personal recommendationIf I were to build this I would remove a lot. I'd remove engines and go Arrow-first with both parquet deserialization and with data access. I think that we could make this a single-python-file effort that had reliably good performance and much easier maintenance. We would lose some features:
Some things to gain
|
Having a smaller environment because you don’t have to install arrow is the main benefit of fastparquet as far as I am aware of. Pandas will require arrow as a dependency starting with the 3.0 release if there aren’t any unexpected developments. This would render the environment size argument irrelevant |
There are likely other benefits to fastparquet, but mostly I just want one system rather than having to support many systems and while I like fastparquet, I like pyarrow marginally more. |
+1 I think specializing on pyarrow will be the biggest benefit here that allows to cut complexity. Additionally, I would like us to review closely how we want to deal with files and rowgroups and the collection of statistics. I'm very strongly biased to not collect statistics of the entire dataset automatically (if ever) but I appreciate that there are use cases where we have to deal with parquet metadata. I would like us to revisit the existing logic and invest time in reproducers that support the features we are offering and allow us to run tests for them. Re: single-file-effort. I believe this is something that we also have to review a little more closely. Especially with the introduction of |
I'm going to take some time this morning and prototype a very simple implementation. If it works I'll use it for TPC-H. We'll see how far I get. I give it a 50/50 chance that I give up because it's too complex. |
This is a dumb, mostly-from-scratch implementation of read_parquet. It only supports - local and s3 - column selection - grouping partitions when we have fewer columns (+ threads!) - arrow engine/filesystem It is very broken in many ways, but ... - It's only around 100 lines of code - I get 250 MB/s bandwidth on full column reads on an m6i.xlarge (only 50 MB/s when reading columns though) See dask/dask#10602
This is a dumb, mostly-from-scratch implementation of read_parquet. It only supports - local and s3 - column selection - grouping partitions when we have fewer columns (+ threads!) - arrow engine/filesystem It is very broken in many ways, but ... - It's only around 100 lines of code - I get 250 MB/s bandwidth on full column reads on an m6i.xlarge (only 50 MB/s when reading columns though) See dask/dask#10602
Cool - Sorry, I saw dask-contrib/dask-expr#373 but I missed this issue. I agree that we should simplify things in parquet land, and the best way to do this is to re-write things from scratch. The parquet-engine landscape is dramatically different than it was when the If you want Somewhat related thoughts: I realize that you are not personally concerned about dask-cudf/rapids - I think that is okay if we can find a way to be smart about “dispatching” |
Yeah, I'm hoping that we can defer a lot of the fancy parquet logic to Arrow. ParquetDataset has a lot of stuff like hive partitioning, yes? Mostly I want to avoid a situation where things get sufficiently complex that people stop maintaining them. This seems to be what happened with parquet. We all knew that it was a trainwreck, but no one wanted to spend the time to make it good. If there is a plan that makes things easy for others to extend and that doesn't make life harder for folks maintaining the core then I have no objection. |
Yes, I completely agree that we should defer as much as possible PyArrow. The PyArrow feature set is probably ready to take the reins.
Yes, it is a mess, but I don't think it is completely a question of time or effort. I'm pretty sure people have just been afraid of breaking the complicated PyArrow/FastParquet/Dask-cuDF/Dask-GeoPandas/etc API contract that was designed at a time when Dask needed to implement most of the ugly dataset parsing/planning logic. We were obviously trying to do something "good" by building a pluggable
That's good to hear. I'm also pretty motivated to minimize maintenance burden. Note: It may also be valuable to define simple |
As a heads-up, I'll likely be far more nervous / hesitant around reworking things into a complex hierarchy. You should probably expect a lot more friction and demands for design docs ahead of time for anything like this. I'm likely going to push to keep things very flat and simple. |
Hmm. That was not a suggestion for any kind of hierarchy at all, so not sure I understand. Either way, I hear you. |
I'd like to comment that if the current parquet code is going to be in "flux" and if the current system isn't really solid/maintained/maintainable, then maybe the advice/best practice of using Parquet (https://docs.dask.org/en/latest/dataframe-best-practices.html#use-parquet) should be (temporarily) removed until a more workable Parquet system is in place. |
Thanks for raising this concern. I believe the API as is is not in flux and it is usable and maintained. It is difficult to maintain and our development velocity is suffering from complexity so we're considering a rewrite. We haven't made a final decision on how exactly to proceed with this but we will make sure that the transition will be as smooth as possible for users. The one recommendation I can give right now is to stay clear of the already deprecated features that are already raising warnings. They will certainly cease to exist. Even with all it's flaws in the current implementation, parquet is still far superior to other storage formats for tabular data and this advice in the documentation is still very accurate. |
@fjetter I'm curious about plans here. As I think through docs and Dask messaging there's a bunch of stuff I want to say about Dask and Spark and performance and my guess is that we'll need to address this before we can say good things there with any strength. I know that this topic isn't top of mind for you right now, but I'd like to align on roughly when something like a performant parquet system might arise. No immediate timing pressure on this, but I'd certainly like to align on this, say, early in 2024. |
In terms of prioritization I consider getting dask-expr into main dask the most important thing right now. There is still a lot of uncertainty so nobody is working on parquet right now. I hope that the dust settles towards end of January such that we gain more confidence about dask-expr and we can pick parquet up again. So, my best guess is end of Q1 |
Just a note that I have started experimenting with better ways to handle options like My assumption is that Coiled wants to take ownership of the parquet reboot. That would be welcome on my end, but I'm also happy to help and engage wherever it would be useful. |
To be clear, anyone is welcome to propose changes / reinventions of
Parquet. We'll eventually take responsibility here assuming no one else
does (which is our current assumption to be safe).
…On Wed, Dec 20, 2023 at 10:21 AM Richard (Rick) Zamora < ***@***.***> wrote:
Just a note that I have started experimenting
<https://gist.github.com/rjzamora/2c55d7e1c614b7df8f27dd92205a43fa> with
better ways to handle options like blocksize in the "future" read_parquet
implementation. Florian has made the point several times that we shouldn't
ever need to parse all the parquet metadata/statistics up-front, and I
completely agree with him.
My assumption is that Coiled wants to take ownership of the parquet
reboot. That would be welcome on my end, but I'm also happy to help and
engage wherever it would be useful.
—
Reply to this email directly, view it on GitHub
<#10602 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTAPQ67LFYHB5SN65T3YKMUCBAVCNFSM6AAAAAA6RTLYZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUHEZTANRUGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Just found out that arrow doesn't support azure blob storage, so that' a good argument to keep the filesystem keyword |
I am using pandas to load a dataset, and fastparquet is more than twice as fast. I wonder if this is due to particularities of this dataset, but if not, I believe fastparquet still has a lot of value. The loading operation was executed 20 times for each engine:
(timings are in seconds) |
Fastparquet certainly does have a lot of value. However, it has been challenging to maintain multiple backends via With that said, it should be pretty easy to use |
Our parquet performance is bad. I get 20MB/s in real-world use cases on the cloud where I would expect 500 MB/s. This accounts for ~80% of our runtime in complex dataframe queries in TPC-H. Systems like P2P, the GIL, Pandas copy-on-write, PyArrow strings, etc, are all inconsequential relative to this performance bottleneck.
Unfortunately, improving this situation is difficult because our current Parquet implementation has many layers and many systems all interwoven with each other. What should be a relatively simple system is today somewhat opaque. I would like for us to consider a rewrite.
There are many things to consider here. I'll follow up with some personal thoughts, and I welcome thoughts from others.
The text was updated successfully, but these errors were encountered: