Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New distributed-memory DAG partitioner #393

Closed
wants to merge 7 commits into from
Closed

Conversation

inducer
Copy link
Owner

@inducer inducer commented Nov 22, 2022

This has the beginnings of a new algorithm for the DAG partitioner that provably (I think!) will not artificially introduce cycles into the partitioned DAG.

Draft because:

  • It is not possible to implement a guaranteed cycle-free find_distributed_partition without communication. As a result, the interface to find_distributed_partition has to change in a backwards-incompatible way, so that we can get an MPI communicator. So this needs to merge at the same time as corresponding downstream PRs. find_distributed_partition is now also MPI-collective.
  • The implementation is unfinished and largely untested. It implements the four steps in the docstring, so that it knows the graph parts at the level of the communication delimiting them. Reinserting the computation into the batches still remains to be done.
  • Make sure the sends and receives are exactly what's given in the communication batches.
  • Maybe split off the initial, communication-needing part of find_distributed_partition from the remainder of the routine.
  • Add a test made from @kaushikcfd's MWE: https://gist.github.com/kaushikcfd/efc4661814f46924b75194dfec177246
  • Rework test_deterministic_partitioning: it no longer makes sense to run it in the absence of MPI.
  • Add a test based on DistributedRecv reachable from multiple parts results in multiple receives for the same tag/rank pair #378.
    • It's structurally not possible for sends and receives to get duplicated, so I think we don't need this.
  • @majosm observed that running examples/distributed.py with three ranks caused cycles at a point. Add tests based on it.
  • Catch when the breadth-first batch-maker fails to make progress, indicating a cycle.
  • Update the description of the algorithm in the docstring.

Notes:

cc @majosm @matthiasdiener @kaushikcfd

Closes: #378.

@kaushikcfd
Copy link
Collaborator

It is not possible to implement a guaranteed cycle-free find_distributed_partition without communication. As a result, the interface to find_distributed_partition has to change in a backwards-incompatible way, so that we can get an MPI communicator. So this needs to merge at the same time as corresponding downstream PRs. find_distributed_partition is now also MPI-collective.

Can you demonstrate with an example where the global communication DAG is sound (i.e. has no dependency cycles) but our generated partition is not? I understand there is #378, but that seems like a minor implementation bug.

@inducer
Copy link
Owner Author

inducer commented Nov 23, 2022

I mean... the burden is really the other way around. Algorithms are suspect until proven correct, not the other way around. 🙂 I did not see a way to prove the existing one correct.

@inducer inducer closed this Nov 23, 2022
@inducer inducer reopened this Nov 23, 2022
@inducer
Copy link
Owner Author

inducer commented Nov 23, 2022

OK, did not mean to either close this or send this comment just yet. But anyway. Constructing a concrete counterexample (other than the potential one in mirgecom) is a task in itself, that's why tools like TLA+ exist. :)

I have a hunch that getting the "clocks" on two ranks misaligned would yield something, using a graph like the following:
grafik

You need three or more ranks in order to be able to do that, which is somewhat consistent with when the problems in mirgecom started appearing.

@kaushikcfd
Copy link
Collaborator

Algorithms are suspect until proven correct, not the other way around. slightly_smiling_face I did not see a way to prove the existing one correct.

That's completely fair. But, before we propose an improvement we need an MWE for which the old implementation leads to undesired partitions. The previous rewrite effort did provide MWEs to justify itself via inducer/grudge#237, #272.

I have a hunch that getting the "clocks" on two ranks misaligned would yield something, using a graph like the following:

Getting an MWE for that should be fairly easy. FWIW, it's not clear to me that it should result in cyclic partition. I just don't want us to spend effort looking for a solution for an unknown problem.

@inducer
Copy link
Owner Author

inducer commented Nov 23, 2022

Getting an MWE for that should be fairly easy. FWIW, it's not clear to me that it should result in cyclic partition. I just don't want us to spend effort looking for a solution for an unknown problem.

I see your point, and in the absence of this cycle thing being a show-stopper for much of CEESD, this would not have happened! Potentially @matthiasdiener (via #391) will supply evidence that the mirgecom-generated graph is OK. In the meantime, I'll also think about a counterexample for the existing verifier. But, perhaps understandably, my first priority is to get CEESD un-stuck.

@inducer
Copy link
Owner Author

inducer commented Nov 23, 2022

I think I've got an example that shows why information on just the local rank won't suffice. Consider this distributed DAG:

grafik

The point is that the "left-going" and the "right-going" communication will look identical to rank 1. I do not think it has a way of knowing that those two can't be batched together without creating a cycle, or that the left-going must come before the right-going. (edit: left and right is hard)

(This was an example that @majosm and I were also discussing earlier today.)

@kaushikcfd
Copy link
Collaborator

The point is that the "left-going" and the "right-going" communication will look identical to rank 1. I do not think it has a way of knowing that those two can't be batched together without creating a cycle, or that the right-going must come before the left-going.

Yep, this makes sense. The current implementation will expect receives for rank-1 in a single part, and this should lead to a deadlock. Thanks a lot, was completely blindsided by this scenario.

@kaushikcfd
Copy link
Collaborator

Here's a reproducer for the hypothesized comm pattern: https://gist.github.com/kaushikcfd/efc4661814f46924b75194dfec177246. It does emit a PartitionInducedCycleError.

Comment on lines 983 to 985
materialized_ary_to_part_id: Dict[Array, int] = {
ary: materialized_ary_to_part_id_end[ary] - 1
for ary in materialized_arrays}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majosm IIUC, this represents a policy of computing things "as late as possible", whereas I'd personally maybe go with computing things "as early as possible". What made you go with late over early? (We should probably add a comment that this is a choice either way.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have any particular reason for choosing that way over the other, just had to pick one. Maybe one way to retroactively justify that choice is that it would keep parts from holding up the show by doing unrelated work before firing off their sends? 🤷‍♂️

pytato/distributed/partition.py Outdated Show resolved Hide resolved
@inducer inducer force-pushed the dist-mem-part-2000 branch from fda3964 to 94b7d19 Compare December 6, 2022 01:25
@inducer
Copy link
Owner Author

inducer commented Jan 13, 2023

The deterministic partitioning test is back, but unfortunately it fails at the moment. See c8db611 and inducer/pymbolic#125 for my current, somewhat radical approach to this.

@inducer
Copy link
Owner Author

inducer commented May 4, 2023

Superseded by #434.

@inducer inducer closed this May 4, 2023
inducer pushed a commit to inducer/grudge that referenced this pull request May 9, 2023
inducer pushed a commit to inducer/grudge that referenced this pull request May 9, 2023
* Support new find_distributed_partition

inducer/pytato#393 changes the
function signature.

* pylint

* flake8

* only catch one type of TypeError

Co-authored-by: Matt Smith <[email protected]>

* flake8

---------

Co-authored-by: Matt Smith <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DistributedRecv reachable from multiple parts results in multiple receives for the same tag/rank pair
3 participants