-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many open files #593
Comments
Yeah that is a pretty oppressive ulimit, if we can't raise it I think asking Killian to raise it is really the only way to go here. |
It looks like there are ~300 open files per sim, which is constant through each time step, and we can run up to 12 sims on a single Sherlock compute node with only rare failure occurrences. This probably means Sherlock reduced the limit recently in response to the filesystem issues they have been experiencing and we are only now running into it. Therefore it's probably not a bug with our code but compute environment circumstances but will be something to consider if running large sets of sims on Sherlock. I'll close this since it seems to be outside our repo. |
For completeness, after a conversation with Kilian it appears this problem should be resolved. Compute nodes were supposed to have different ulimits:
In certain circumstances (ssh'ing directly to compute nodes and most likely - although not confirmed before the config change - launching fireworks with qlaunch) the ulimits were the same as the login node as listed above. qlaunch sees the higher limits and I expect the errors will not occur once we start running large sets of sims. |
Reopening because this issue came up again when running sims for the paper. One seed failed with too many open files when running the following workflow with qlaunch:
If we keep running into this issue, it will take a long time to run the sims on sherlock so we should probably troubleshoot or consider alternatives. A quick check with some running jobs on Sherlock gives a very low open number of files so it seems odd we'd be hitting the limit of 131072 unless it's dependent on other users, the limit is not actually that high or we have a spike of open files at some point. Node with one sim:
Node with one multigen analysis:
Node with one single analysis:
|
About 10% of the sims failed overnight because of too many open files. I've placed some print statements (added the # comments afterwords to show the command) before and after the firework runs as well as right before the exception is raised in
|
From Kilian:
Looks like it was a sherlock issue so hopefully this will go away soon |
This time the fix actually worked. No failed runs overnight due to the filesystem |
I was trying to run a bunch of sims from
release-paper
on Sherlock (4 gens x 256 seeds) and ran into differentIOError
andOSError
messages about too many open files. Perhaps similar to the Sisyphus issue (although this occurred in sims not the causality network), it might indicate that we aren't properly closing files. Out of the 256 seeds, 202 failed at some point trying to write files or make directories usingmakedirs
. This could have been because multiple sims were running on the same nodes and the problem is specific to Sherlock but it's worth investigating. The problem is we don't have sudo permissions on Sherlock so we can't change file limits. I think this would be a good test case for the new gcloud workflow and can follow up with Jerry and Ryan about getting therelease-paper
branch up and running on gcloud.For reference, the current hard and soft limits on Sherlock are:
Make directories
OSError
(example stack trace, happened with different tables with 177 failures):Write file
IOError
(example stack trace, happened with different tables):The text was updated successfully, but these errors were encountered: