-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packing all part outputs/inputs into a single buffer #305
Comments
I am a bit surprised to hear that the sheer number of comm buffers is driving significant growth in the argument count. With chemistry, I'd expect ~10 buffers per neighbor, which isn't nothing, but it's not huge compared to the limit (500-something?). I also feel reminded that buffer merging may mitigate the issue, but it doesn't address it head-on. So there can still be unhappy DAGs that overflow the arg count. Maybe we should try to think about preventing this issue outright, either by limiting fusion, or by partitioning into appropriately-sized pieces a priori. What do you think? As for the actual idea: I can see some tricky bits in realizing it, but in general I think the idea is well-defined and feasible. The tricky bits I see are:
|
Thanks for taking a look and the in-depth suggestion. I agree that this is not quite the correct way to go. Even on single rank multispecies operators, we obtain the all-face flux computing kernels with ~700 arguments. (see https://gist.github.com/kaushikcfd/d10ea12c682bc897437a54d28a41da4d)
It also depends on what expressions are materialized. Post inducer/meshmode#312, it turns out the number of distinct common sub-expressions went up.
Although this is a good option, I get a feeling that cooking up a heuristic for this won't be easy. (Would be a combination of CSE, apriori #kernel argument estimation). The closest solution that I think will reliably work is inducer/loopy#599. |
I'm wary of the complexity this inflicts on memory allocation. One sub-buffer will keep lots of others alive (to be fair, only for the duration of the operator, but our intra-operator memory consumption is already far higher than it needs to be IMO), cf. all the out-of-memory issues that are limiting scale. And it's not like we can make our memory allocation scheme aware of these shenanigans: doing so would require passing an offset to the kernel, negating the benefits.
How is that possible? Do you have a sense of why that's happening? It doesn't make intuitive sense to me, and the code you linked... isn't helping :).
I agree, it's a thorny issue. I'd be inclined to do something once we're in loopy representation: shove in some additional global barriers if we're at risk of overrunning the arg count limit. Another thing I'm asking myself is: Human-written codes get by with far fewer than 500-odd temporaries. How can we approximate their behavior better? |
Since the large number of kernel arguments has been a problem (cf illinois-ceesd/mirgecom#599), I would propose we have a transformation that packs all sends/recvs of parts into a single buffer and their uses are replaced with an offset-ed index expression of the packed buffer. Does anyone see a way this could go wrong?
The text was updated successfully, but these errors were encountered: