-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add-hostfile not working on parallel prun commands #1773
Comments
Looks like the remote host cannot open a TCP socket back to
are you running with both hosts in your hostfile? Or just the one? |
I am running with just one host on the hostfile. I meant to say that there have been no problems running multiple prun commands without add-hostfile option. |
I understood that last part. My point was just that if there is only one host in the hostfile, then you won't detect that your remote host cannot communicate to you. Try putting both hosts in that hostfile and see if you can start the DVM. |
I wanted to use the DVM expansion feature, starting DVM with one node and then add another node with |
I understand what you want to do - I'm just trying to check that PRRTE itself is behaving correctly. With both hosts in the hostfile, it starts - which means that the daemon can indeed communicate back. So the question becomes: why can't it do so when started by add-hostfile? You were able to do it before, so what has changed? If you run the two |
Yes. Running sequentially works. |
I'm not familiar with Python's Let's see if the problem really is in PRRTE and not in how you are trying to do this. Add |
Oh yeah - also add |
Yes. The project I am working on uses fork method for starting the processes using python multiprocessing module. The following is the log after it receives prun commands.
|
Understood - but as implemented, your test will yield non-deterministic results. Is that what you want? Try adding |
This is the log with
|
Just for clarification - is this that Slurm environment again? If so, that could well be the problem. |
I am running under the Slurm allocation but with prrte and pmix built with |
Sounds suspicious - try running |
This is the result for
|
Sigh - I want to see the output when it immediately starts up, please. |
Question: is |
I am sorry. Here is the full output.
|
I found another strange issue while running
|
Okay, that confirms the setup - no Slurm interactions. I'm afraid this will take a while to track down. It's some kind of race condition, though the precise nature of it remains hard to see. Unfortunately, I'm pretty occupied right now, which will further delay things. I'd suggest you run those
No ideas - I can try to reproduce, but don't know if/when I'll be able to do so. |
Background information
Working on a project, one part of which runs multiple
prun
commands in parallel from multiple processes to launch multiple tasks, some of these commands with--add-hostfile
option to extend existing DVM.What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)
55536ef
What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)
openpmix/openpmix@bde8038
Please describe the system on which you are running
Details of the problem
Steps to reproduce
hostfile
with one node andadd_hostfile
with another nodehostfile
to start DVMprte --report-uri dvm.uri --hostfile hostfile
prun
commands in parallel, one withadd-hostfile
option and another without as the following.It outputs the following error.
If no
add-hostfile
option is given, both processes run without error.While debugging, it can be seen that the daemon was launched in the added node but throws following error during initialization and terminates the PMIX Server.
@hppritcha @rhc54 Please advise.
The text was updated successfully, but these errors were encountered: