Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packing all part outputs/inputs into a single buffer #305

Open
kaushikcfd opened this issue Apr 9, 2022 · 3 comments
Open

Packing all part outputs/inputs into a single buffer #305

kaushikcfd opened this issue Apr 9, 2022 · 3 comments

Comments

@kaushikcfd
Copy link
Collaborator

Since the large number of kernel arguments has been a problem (cf illinois-ceesd/mirgecom#599), I would propose we have a transformation that packs all sends/recvs of parts into a single buffer and their uses are replaced with an offset-ed index expression of the packed buffer. Does anyone see a way this could go wrong?

@inducer
Copy link
Owner

inducer commented Apr 9, 2022

I am a bit surprised to hear that the sheer number of comm buffers is driving significant growth in the argument count. With chemistry, I'd expect ~10 buffers per neighbor, which isn't nothing, but it's not huge compared to the limit (500-something?).

I also feel reminded that buffer merging may mitigate the issue, but it doesn't address it head-on. So there can still be unhappy DAGs that overflow the arg count. Maybe we should try to think about preventing this issue outright, either by limiting fusion, or by partitioning into appropriately-sized pieces a priori. What do you think?

As for the actual idea: I can see some tricky bits in realizing it, but in general I think the idea is well-defined and feasible. The tricky bits I see are:

  • The two sides of the communication need to agree on what's in the buffer, likely via the existing tags. Maybe all we need is another tag of PackedBuffer((tag1, tag2, tag3))? It'd be easier if we had a way to unambiguously sort tags, but I'm not sure that's something we've required so far.
  • Can this be done as a pass that only works only on the pytato DAG? I think, possibly yes, though buffer order might be an issue. I'd be less fond of it if this is something that we'd have to bake into codegen.
  • This forces a bit more synchronization than is implied in the DAG, but we can always back that off if/when it becomes a problem.

@kaushikcfd
Copy link
Collaborator Author

Thanks for taking a look and the in-depth suggestion.

I agree that this is not quite the correct way to go. Even on single rank multispecies operators, we obtain the all-face flux computing kernels with ~700 arguments. (see https://gist.github.com/kaushikcfd/d10ea12c682bc897437a54d28a41da4d)

With chemistry, I'd expect ~10 buffers per neighbor, which isn't nothing, but it's not huge compared to the limit (500-something?).

It also depends on what expressions are materialized. Post inducer/meshmode#312, it turns out the number of distinct common sub-expressions went up.

Maybe we should try to think about preventing this issue outright, either by limiting fusion, or by partitioning into appropriately-sized pieces apriori.

Although this is a good option, I get a feeling that cooking up a heuristic for this won't be easy. (Would be a combination of CSE, apriori #kernel argument estimation). The closest solution that I think will reliably work is inducer/loopy#599.

@inducer
Copy link
Owner

inducer commented Apr 10, 2022

The closest solution that I think will reliably work is inducer/loopy#599.

I'm wary of the complexity this inflicts on memory allocation. One sub-buffer will keep lots of others alive (to be fair, only for the duration of the operator, but our intra-operator memory consumption is already far higher than it needs to be IMO), cf. all the out-of-memory issues that are limiting scale. And it's not like we can make our memory allocation scheme aware of these shenanigans: doing so would require passing an offset to the kernel, negating the benefits.

it turns out the number of distinct common sub-expressions went up.

How is that possible? Do you have a sense of why that's happening? It doesn't make intuitive sense to me, and the code you linked... isn't helping :).

I get a feeling that cooking up a heuristic for this won't be easy

I agree, it's a thorny issue. I'd be inclined to do something once we're in loopy representation: shove in some additional global barriers if we're at risk of overrunning the arg count limit.

Another thing I'm asking myself is: Human-written codes get by with far fewer than 500-odd temporaries. How can we approximate their behavior better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants