-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New distributed-memory DAG partitioner #393
Conversation
Can you demonstrate with an example where the global communication DAG is sound (i.e. has no dependency cycles) but our generated partition is not? I understand there is #378, but that seems like a minor implementation bug. |
I mean... the burden is really the other way around. Algorithms are suspect until proven correct, not the other way around. 🙂 I did not see a way to prove the existing one correct.
|
That's completely fair. But, before we propose an improvement we need an MWE for which the old implementation leads to undesired partitions. The previous rewrite effort did provide MWEs to justify itself via inducer/grudge#237, #272.
Getting an MWE for that should be fairly easy. FWIW, it's not clear to me that it should result in cyclic partition. I just don't want us to spend effort looking for a solution for an unknown problem. |
I see your point, and in the absence of this cycle thing being a show-stopper for much of CEESD, this would not have happened! Potentially @matthiasdiener (via #391) will supply evidence that the mirgecom-generated graph is OK. In the meantime, I'll also think about a counterexample for the existing verifier. But, perhaps understandably, my first priority is to get CEESD un-stuck. |
I think I've got an example that shows why information on just the local rank won't suffice. Consider this distributed DAG: The point is that the "left-going" and the "right-going" communication will look identical to rank 1. I do not think it has a way of knowing that those two can't be batched together without creating a cycle, or that the left-going must come before the right-going. (edit: left and right is hard) (This was an example that @majosm and I were also discussing earlier today.) |
Yep, this makes sense. The current implementation will expect receives for rank-1 in a single part, and this should lead to a deadlock. Thanks a lot, was completely blindsided by this scenario. |
Here's a reproducer for the hypothesized comm pattern: https://gist.github.com/kaushikcfd/efc4661814f46924b75194dfec177246. It does emit a |
c993f0b
to
0cc2f46
Compare
pytato/distributed/partition.py
Outdated
materialized_ary_to_part_id: Dict[Array, int] = { | ||
ary: materialized_ary_to_part_id_end[ary] - 1 | ||
for ary in materialized_arrays} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majosm IIUC, this represents a policy of computing things "as late as possible", whereas I'd personally maybe go with computing things "as early as possible". What made you go with late over early? (We should probably add a comment that this is a choice either way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have any particular reason for choosing that way over the other, just had to pick one. Maybe one way to retroactively justify that choice is that it would keep parts from holding up the show by doing unrelated work before firing off their sends? 🤷♂️
fda3964
to
94b7d19
Compare
537486f
to
7e452ae
Compare
inducer/pytato#393 changes the function signature.
inducer/pytato#393 changes the function signature.
7e452ae
to
300edae
Compare
effd604
to
692e681
Compare
c9f0320
to
c8db611
Compare
The deterministic partitioning test is back, but unfortunately it fails at the moment. See c8db611 and inducer/pymbolic#125 for my current, somewhat radical approach to this. |
c8db611
to
f8dacd3
Compare
Co-authored-by: Matt Smith <[email protected]>
f8dacd3
to
7b38c26
Compare
Superseded by #434. |
inducer/pytato#393 changes the function signature.
* Support new find_distributed_partition inducer/pytato#393 changes the function signature. * pylint * flake8 * only catch one type of TypeError Co-authored-by: Matt Smith <[email protected]> * flake8 --------- Co-authored-by: Matt Smith <[email protected]>
This has the beginnings of a new algorithm for the DAG partitioner that provably (I think!) will not artificially introduce cycles into the partitioned DAG.
Draft because:
find_distributed_partition
without communication. As a result, the interface tofind_distributed_partition
has to change in a backwards-incompatible way, so that we can get an MPI communicator. So this needs to merge at the same time as corresponding downstream PRs.find_distributed_partition
is now also MPI-collective.Maybe split off the initial, communication-needing part offind_distributed_partition
from the remainder of the routine.test_deterministic_partitioning
: it no longer makes sense to run it in the absence of MPI.Add a test based on DistributedRecv reachable from multiple parts results in multiple receives for the same tag/rank pair #378.examples/distributed.py
with three ranks caused cycles at a point. Add tests based on it.Notes:
find_distributed_partition
. That may be nicer to look at rendered in sphinx.find_distributed_partitions
#277.cc @majosm @matthiasdiener @kaushikcfd
Closes: #378.