Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default SSH for omni_util #40

Closed
jordap opened this issue Jun 21, 2024 · 6 comments · Fixed by #49
Closed

Default SSH for omni_util #40

jordap opened this issue Jun 21, 2024 · 6 comments · Fixed by #49
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jordap
Copy link
Collaborator

jordap commented Jun 21, 2024

We currently have a --use_pdsh flag that is enabled by default, and cannot be disabled from the command line:
https://github.com/AMDResearch/omniwatch/blob/38c7ed42ad3669385fe2857de767428645501f7d/omniwatch/omni_util.py#L305

Options:

  1. If standard ssh calls are supposed to be the default, the --use_pdsh flag should default to False.
  2. If PDSH is supposed to be the default, we could change the flag to --disable-pdsh (default: False, action: store_true).
  3. If we want to keep a non-negative flag allowing it to be disabled, we can use the following approach:
    parser.add_argument('--pdsh', dest='pdsh', action='store_true')
    parser.add_argument('--no-pdsh', dest='pdsh', action='store_false')
    parser.set_defaults(pdsh=True)
    
@jordap jordap added the bug Something isn't working label Jun 21, 2024
@jordap jordap added this to the 1.0 milestone Jun 21, 2024
@koomie
Copy link
Collaborator

koomie commented Jun 21, 2024

I made it the default as it is the only viable option on large systems like Frontier. We can deprecate the older serialized ssh variant and just make this pdsh approach the only launch path in user-mode. At that point we can remove the command-line argument alltogether.

@koomie koomie self-assigned this Jun 21, 2024
@jordap
Copy link
Collaborator Author

jordap commented Jun 21, 2024

Alright, that probably makes sense and will make things simpler. I was a little hesitant to remove the serialized SSH because PDSH isn't working properly in all the systems I've been using for testing, but the SSH version works just fine. (This will force me to look into it and figure out what's happening with PDSH.)

@koomie
Copy link
Collaborator

koomie commented Jun 21, 2024

Now that you say that, I believe I did encounter an issue with pdsh variant on hpcfund. Assuming we can resolve that, I think we only need a single approach.

@jordap
Copy link
Collaborator Author

jordap commented Jun 28, 2024

I haven't investigated this issue, but found a quick workaround that fixes it in the cluster where I'm testing.

Instead of the using the standard import: https://github.com/AMDResearch/omniwatch/blob/38c7ed42ad3669385fe2857de767428645501f7d/omniwatch/omni_util.py#L40

Use the following import to force libssh (as opposed to libssh2):

from pssh.clients.ssh.parallel import ParallelSSHClient

Sharing this workaround for reference. It may be useful to debug the issue later.

@jordap
Copy link
Collaborator Author

jordap commented Jun 28, 2024

Additional information: ParallelSSH/parallel-ssh#363

Seems to be an issue with certain keys and newer SSH versions. Frontier is running an older version of SSH (8.4p1) compared to other systems we use (8.7p1 - 8.9p1).

Generating a new RSA key doesn't help. Other types of keys work fine in my environment, but RSA is widely used and I prefer not to ask users to generate a new type of key only for Omnistat.

I haven't tested anything else beyond that. We'd need to test performance at scale, but using ssh-python with the alternative import seems to work and it may be the easiest solution.

@jordap
Copy link
Collaborator Author

jordap commented Jul 3, 2024

I tested this in Frontier, and it seems to work fine with libssh.

@jordap jordap linked a pull request Jul 3, 2024 that will close this issue
@koomie koomie closed this as completed in #49 Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants