Add IO fusion if we can reduce number of partitions #327

phofl · 2023-10-11T13:58:45Z

We end up with a lot of small chunks if we can drop lots of columns from the parquet files. This is especially bad for P2P based merges, since small chunks slow us down there. Squashing the partitions together solves that problem so that we end up with the original chunksize

…ter optimizations

mrocklin · 2023-10-11T14:26:45Z

This structure of containing one IO expr in another seems fine, but also a little complicated.

In an ideal world I think that the IO expressions, like ReadParquet, would have an ability to group multiple partitions together, and that we would use those parameters instead of FusedIO. Thoughts? Maybe this can be achieved with a convention among IO expressions?

phofl · 2023-10-11T14:29:02Z

Having this abstraction made the generic implementation easier since we don't have to worry how the single partitions are created in the first place. I agree with you generally though, but I think it's easier to figure this out once we have more I/O connectors?

mrocklin · 2023-10-11T14:36:56Z

dask_expr/io/io.py

+    def _factor(self):
+        return 1


This could probably use a better name, or an explanatory docstring

Yeah good point

mrocklin · 2023-10-11T14:37:51Z

dask_expr/io/parquet.py

+        if self.operand("columns") is None:
+            return 1
+        nr_original_columns = len(self._dataset_info["schema"].names) - 1
+        return len(_convert_to_list(self.operand("columns"))) / nr_original_columns


Maybe this shouldn't be a property or an instance method, but is instead some logic that we do within the fusion operation. It seems to be only relevant for that optimization. Is that correct?

Yes, but the factor calculation will be different for different I/O methods, so it's tricky coming up with a general mechanism that won't need a bunch of special casing.

E.g. CSV will need different logic than here

mrocklin · 2023-10-11T14:42:50Z

Having this abstraction made the generic implementation easier since we don't have to worry how the single partitions are created in the first place. I agree with you generally though, but I think it's easier to figure this out once we have more I/O connectors?

Maybe if we're thinking short-term then we just do this on ReadParquet and ReadCSV in a less general purpose way? This doesn't even need to be an optimization, but could instead be a default behavior on the class itself.

I don't fundamentally disagree with the approach taken here, but it seems like using a big gun to solve a specific problem. I suspect that a less sophisticated solution would be as effective with what we have today. If we find that it's hard to do this on each IO class consistently then I think it would make a lot of sense to do something generic like this.

In practice I think it's not a big deal today, but I'm imagining the developer of next year who is presented with this FusedIO thing and then needs to figure out why it exists.

mrocklin · 2023-10-11T14:43:52Z

Not a big deal. I'm happy to be overruled on this. It just feels a little off to me.

phofl · 2023-10-11T14:53:42Z

There is another issue, the fusion we are doing here needs to happen after combine similar but before the blockwise fusion happens. We basically need another optimisation step and I don't think that we can encapsulate this into the ReadParquet class, we would have to subclass to override the divisions calculation. I think this is the cleanest solution, but I might be very wrong as well.

mrocklin · 2023-10-11T15:00:14Z

dask_expr/_expr.py

+        return expr
+
+    expr = _fusion_pass(expr)
+    return expr


Maybe this could be implemented with the following:

class BlockwiseIO: def simplify_down(self): if self._fusion_compression_factor < 1: return FusedIO(self)

I might be missing something here though.

Maybe this is part of my confusion here. This seems like an optimization that we always want to make on an expression type. I think that that means that we don't really need to use the Expression stuff at all. This could be a method on IO.

Sometimes people get used to using a fancy form of optimization that they forget about the simpler approaches they have.

No sorry that won't work in simplify (see below). Happy to chat if my explanation is confusion. it has to come after combine_similar.

mrocklin · 2023-10-11T15:03:36Z

There is another issue, the fusion we are doing here needs to happen after combine similar but before the blockwise fusion happens. We basically need another optimisation step and I don't think that we can encapsulate this into the ReadParquet class, we would have to subclass to override the divisions calculation. I think this is the cleanest solution, but I might be very wrong as well

I'm not sure I understand. So let's say that after various optimizaions, we have the following expression:

ReadParquet(path=..., columns=["a", "b"])

For a dataset with ten columns. OK, while normally this expression would report that it has 100 partitions, it now reports that it has 20, and every task that it emits reads five row groups.

phofl · 2023-10-11T15:18:54Z

After simplify but before combine_similar we might have 5 different operations that read from the same parquet file

ReadParquet(path=x, columns=["a", "b"])
ReadParquet(path=x, columns=["c", "b"])
ReadParquet(path=x, columns=["a"])
ReadParquet(path=x, columns=["a", "b", "c"])

Combine_similar squashes all of them together so that every IO op from "x" is

ReadParquet(path=x, columns=["a", "b", "c"])

We can not determine the compression factor before we remove repeated reads from storage and create the union of columns that we need. So this fusion step has to come after combine_similar.

I am open to add a step that's similar to simplify, but we can't combine both of them together

mrocklin · 2023-10-11T15:22:50Z

Sorry, maybe I'm not being clear with my recent suggestion. It's that there is no optimization step, but rather than an expression like ReadParquet(path=x, columns=["a", "b", "c"]) knows that it should generate tasks that read from several row groups. So there's no optimization or Expr manipulation necessary, it's just that the default number of row groups per partition changes in ReadParquet based on the selected columns.

Again though, I'm also happy if you just override me here. I don't think that what I'm suggesting is strictly better or strictly simple. It moves the complexity into the expression, rather than making a new nested expression. It's a less general solution, but also a lower-tech solution. Not better, just different. If you just want to move forward with your plan then please do.

phofl · 2023-10-11T15:26:10Z

I tried this approach in the beginning as well, but I couldn't figure out a way of calculating the divisions appropriately, because we have to use the initial divisions until combine_similar happens before we can switch to the version that reads multiple parts. This got pretty complicated

mrocklin · 2023-10-11T15:28:30Z

I'm curious to learn more about that. I would hope that during optimization we don't really need to query divisions at all.

Happy to chat any time. I'll be in meetings for the next couple of hours though.

If you want to go ahead and merge this and move forwards that's fine with me. I get a little nervous as things build on though.

phofl · 2023-10-11T15:31:29Z

Blockwise ops validate divisions for example, so we don't get away without them

rjzamora · 2023-10-11T15:51:09Z

Blockwise ops validate divisions for example, so we don't get away without them

Yeah, I think some of the pain indeed comes down to the fact that we rely so much on divisions/npartitions in many Expr classes (which I was also thinking about yesterday). I think optimizations like this would be much easier if we tried producing "abstract" classes from the collection API whenever possible, and avoided "lowering" to expressions with divisions/npartitions support until it becomes necessary. However, this is a non-trivial change at this point.

That said - I'm not really sure we need to introduce a new Expr class here. It seems like it should be possible to replace a ReadParquet instance with another ReadParquet instance having a larger blocksize argument?

phofl · 2023-10-11T15:54:46Z

I don't really know how all the parquet operations work internally, so you are probably better suited to judge this.

The issue I am seeing here is that I don't really know what the block size is. I want to make sure that I end up with the initial block size, e.g. we combine two partitions into one if we drop half the columns. I am open to suggestions if we can do something similar with the current read_parquet arguments (that said I would like to avoid inspecting s3 again for meta and other attributes, the cost is non-trivial)

Edit: This obviously wouldn't work for other IO operations, so only a partial solution

rjzamora · 2023-10-11T16:24:48Z

(that said I would like to avoid inspecting s3 again for meta and other attributes, the cost is non-trivial)

Yeah, that's the part I'm worried about as well. I know we don't need to collect new metadata, but I'm not sure if something will need to change to avoid this. I do agree with your interest in adding general Expr-based solution here, but I'd like to take a bit of time to think of alternatives.

Side Note: The most general solution is probably to avoid collecting divisions at all in top-level IO expressions like ReadParquet, and only do this after a simplify/combine_similar pass, when a "lower" operation could convert the abstract expression to something like ReadParquetPartitions (at which point the column projection could be taken into account when choosing file/row-group to partition mapping).

rjzamora · 2023-10-11T16:26:15Z

dask_expr/io/io.py

@@ -121,6 +126,37 @@ def _combine_similar(self, root: Expr):
        return


+class FusedIO(BlockwiseIO):
+    _parameters = ["expr"]


I suppose we will still generate the graph for self.operand("expr") when FusedIO.__dask_graph__ is called.

Yes in _task

phofl · 2023-10-11T22:30:38Z

@mrocklin and I chatted offline about this. The consensus was that we would like to remove the dependency on divisions in the future. This will probably open up ways how we can integrate this solution here better with BlockwiseIO or similar. This is hard to do until we get there though.

I'll merge this solution now, happy to reiterate what we have here as long as we keep the performance improvement. I might change the way this is applied a little bit tomorrow or on Friday depending on how some of my explorations go. Current plan is to continue iterating to improve performance of the system through new optimisations/new wrinkles that we can introduce

rjzamora · 2023-10-11T23:12:54Z

The consensus was that we would like to remove the dependency on divisions in the future.

I'm on board with this (as you know). Interested to know if you are looking to remove the dependency across the board in a general way, or if you just want to avoid division/npartition queries unless absolutely necessary?

phofl added 7 commits October 10, 2023 23:16

Implement inject operations that allow us to modify the expression af…

32d6d34

…ter optimizations

Fix logic

c2e18ce

Fix

101748e

Fix

f358472

Fix missing columns

1dc9ca5

Fix partitions push up

d2f1c28

Add test

55f5910

mrocklin reviewed Oct 11, 2023

View reviewed changes

Rename factor

594a4b3

mrocklin reviewed Oct 11, 2023

View reviewed changes

mrocklin mentioned this pull request Oct 11, 2023

Make shuffle robust to small partitions dask/distributed#8259

Open

rjzamora reviewed Oct 11, 2023

View reviewed changes

phofl merged commit 044617b into dask:main Oct 11, 2023
4 checks passed

phofl deleted the fuse_io branch October 11, 2023 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IO fusion if we can reduce number of partitions #327

Add IO fusion if we can reduce number of partitions #327

phofl commented Oct 11, 2023

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023

mrocklin Oct 11, 2023

phofl Oct 11, 2023

mrocklin Oct 11, 2023

phofl Oct 11, 2023

mrocklin commented Oct 11, 2023

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023

mrocklin Oct 11, 2023

phofl Oct 11, 2023

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023 •

edited

Loading

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023

rjzamora commented Oct 11, 2023

phofl commented Oct 11, 2023 •

edited

Loading

rjzamora commented Oct 11, 2023

rjzamora Oct 11, 2023

phofl Oct 11, 2023

phofl commented Oct 11, 2023

rjzamora commented Oct 11, 2023

Add IO fusion if we can reduce number of partitions #327

Add IO fusion if we can reduce number of partitions #327

Conversation

phofl commented Oct 11, 2023

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Oct 11, 2023

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023 • edited Loading

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023

mrocklin commented Oct 11, 2023

phofl commented Oct 11, 2023

rjzamora commented Oct 11, 2023

phofl commented Oct 11, 2023 • edited Loading

rjzamora commented Oct 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Oct 11, 2023

rjzamora commented Oct 11, 2023

phofl commented Oct 11, 2023 •

edited

Loading

phofl commented Oct 11, 2023 •

edited

Loading