Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open files #593

Closed
tahorst opened this issue Jul 11, 2019 · 7 comments
Closed

Too many open files #593

tahorst opened this issue Jul 11, 2019 · 7 comments
Labels

Comments

@tahorst
Copy link
Member

tahorst commented Jul 11, 2019

I was trying to run a bunch of sims from release-paper on Sherlock (4 gens x 256 seeds) and ran into different IOError and OSError messages about too many open files. Perhaps similar to the Sisyphus issue (although this occurred in sims not the causality network), it might indicate that we aren't properly closing files. Out of the 256 seeds, 202 failed at some point trying to write files or make directories using makedirs. This could have been because multiple sims were running on the same nodes and the problem is specific to Sherlock but it's worth investigating. The problem is we don't have sudo permissions on Sherlock so we can't change file limits. I think this would be a good test case for the new gcloud workflow and can follow up with Jerry and Ryan about getting the release-paper branch up and running on gcloud.

For reference, the current hard and soft limits on Sherlock are:

$ ulimit -Hn
4096

$ ulimit -Sn
1024

Make directories OSError (example stack trace, happened with different tables with 177 failures):

Traceback (most recent call last):
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 262, in run
    m_action = t.run_task(my_spec)
  File "/home/users/thorst/wcEcoli/wholecell/fireworks/firetasks/simulationDaughter.py", line 39, in run_task
    sim = EcoliDaughterSimulation(**options)
  File "/home/users/thorst/wcEcoli/wholecell/sim/simulation.py", line 117, in __init__
    self._initialize(sim_data)
  File "/home/users/thorst/wcEcoli/wholecell/sim/simulation.py", line 179, in _initialize
    logger.initialize(self)
  File "/home/users/thorst/wcEcoli/wholecell/loggers/disk.py", line 55, in initialize
    self.copyData(sim)
  File "/home/users/thorst/wcEcoli/wholecell/loggers/disk.py", line 97, in copyData
    obj.tableAppend(saveFile)
  File "/home/users/thorst/wcEcoli/models/ecoli/listeners/ribosome_data.py", line 121, in tableAppend
    numTrpATerminated = self.numTrpATerminated,
  File "/home/users/thorst/wcEcoli/wholecell/io/tablewriter.py", line 299, in append
    for name in namesAndValues.viewkeys()
  File "/home/users/thorst/wcEcoli/wholecell/io/tablewriter.py", line 299, in <dictcomp>
    for name in namesAndValues.viewkeys()
  File "/home/users/thorst/wcEcoli/wholecell/io/tablewriter.py", line 94, in __init__
    filepath.makedirs(path)
  File "/home/users/thorst/wcEcoli/wholecell/utils/filepath.py", line 37, in makedirs
    os.makedirs(full_path)
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 24] Too many open files: '/home/users/thorst/wcEcoli/out/20190710.183420__SET_D_4_gens_256_seeds,_unfit_ribosome_and_rna_poly_expression,_adjust_rnases/wildtype_000000/000159/generation_000003/000000/simOut/RibosomeData/columns/translationSupply'
Exception AttributeError: "'_Column' object has no attribute '_data'" in <bound method _Column.__del__ of <wholecell.io.tablewriter._Column object at 0x7f780ca54210>> ignored

Write file IOError (example stack trace, happened with different tables):

Traceback (most recent call last):
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 262, in run
    m_action = t.run_task(my_spec)
  File "/home/users/thorst/wcEcoli/wholecell/fireworks/firetasks/simulationDaughter.py", line 39, in run_task
    sim = EcoliDaughterSimulation(**options)
  File "/home/users/thorst/wcEcoli/wholecell/sim/simulation.py", line 117, in __init__
    self._initialize(sim_data)
  File "/home/users/thorst/wcEcoli/wholecell/sim/simulation.py", line 179, in _initialize
    logger.initialize(self)
  File "/home/users/thorst/wcEcoli/wholecell/loggers/disk.py", line 55, in initialize
    self.copyData(sim)
  File "/home/users/thorst/wcEcoli/wholecell/loggers/disk.py", line 97, in copyData
    obj.tableAppend(saveFile)
  File "/home/users/thorst/wcEcoli/wholecell/listeners/evaluation_time.py", line 114, in tableAppend
    evolveState_total = self.evolveState_total,
  File "/home/users/thorst/wcEcoli/wholecell/io/tablewriter.py", line 299, in append
    for name in namesAndValues.viewkeys()
  File "/home/users/thorst/wcEcoli/wholecell/io/tablewriter.py", line 299, in <dictcomp>
    for name in namesAndValues.viewkeys()
  File "/home/users/thorst/wcEcoli/wholecell/io/tablewriter.py", line 97, in __init__
    self._offsets = open(os.path.join(path, FILE_OFFSETS), "w")
IOError: [Errno 24] Too many open files: u'/home/users/thorst/wcEcoli/out/20190710.183420__SET_D_4_gens_256_seeds,_unfit_ribosome_and_rna_poly_expression,_adjust_rnases/wildtype_000000/000112/generation_000003/000000/simOut/EvaluationTime/columns/calculateRequest_times/offsets'
Exception AttributeError: "'_Column' object has no attribute '_offsets'" in <bound method _Column.__del__ of <wholecell.io.tablewriter._Column object at 0x7f379460e210>> ignored
@tahorst tahorst added the bug label Jul 11, 2019
@prismofeverything
Copy link
Member

Yeah that is a pretty oppressive ulimit, if we can't raise it I think asking Killian to raise it is really the only way to go here.

@tahorst
Copy link
Member Author

tahorst commented Jul 12, 2019

It looks like there are ~300 open files per sim, which is constant through each time step, and we can run up to 12 sims on a single Sherlock compute node with only rare failure occurrences. This probably means Sherlock reduced the limit recently in response to the filesystem issues they have been experiencing and we are only now running into it. Therefore it's probably not a bug with our code but compute environment circumstances but will be something to consider if running large sets of sims on Sherlock. I'll close this since it seems to be outside our repo.

@tahorst tahorst closed this as completed Jul 12, 2019
@tahorst
Copy link
Member Author

tahorst commented Aug 7, 2019

For completeness, after a conversation with Kilian it appears this problem should be resolved. Compute nodes were supposed to have different ulimits:

$ ulimit -Hn
131072

$ ulimit -Sn
131072

In certain circumstances (ssh'ing directly to compute nodes and most likely - although not confirmed before the config change - launching fireworks with qlaunch) the ulimits were the same as the login node as listed above. qlaunch sees the higher limits and I expect the errors will not occur once we start running large sets of sims.

@tahorst
Copy link
Member Author

tahorst commented Aug 20, 2019

Reopening because this issue came up again when running sims for the paper. One seed failed with too many open files when running the following workflow with qlaunch:

DESC="SET A 32 gens 8 seeds basal with growth noise and D period" \
VARIANT="wildtype" FIRST_VARIANT_INDEX=0 LAST_VARIANT_INDEX=0 \
SINGLE_DAUGHTERS=1 N_GENS=32 N_INIT_SIMS=8 \
MASS_DISTRIBUTION=1 GROWTH_RATE_NOISE=1 D_PERIOD_DIVISION=1 \
python runscripts/fw_queue.py
Traceback (most recent call last):
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 262, in run
    m_action = t.run_task(my_spec)
  File "/home/users/thorst/wcEcoli/wholecell/fireworks/firetasks/simulationDaughter.py", line 53, in run_task
    sim = EcoliDaughterSimulation(**options)
  File "/home/users/thorst/wcEcoli/wholecell/sim/simulation.py", line 119, in __init__
    self._initialize(sim_data)
  File "/home/users/thorst/wcEcoli/wholecell/sim/simulation.py", line 181, in _initialize
    logger.initialize(self)
  File "/home/users/thorst/wcEcoli/wholecell/loggers/disk.py", line 55, in initialize
    self.copyData(sim)
  File "/home/users/thorst/wcEcoli/wholecell/loggers/disk.py", line 97, in copyData
    obj.tableAppend(saveFile)
  File "/home/users/thorst/wcEcoli/models/ecoli/listeners/rna_synth_prob.py", line 64, in tableAppend
    nActualBound = self.nActualBound,
  File "/home/users/thorst/wcEcoli/wholecell/io/tablewriter.py", line 299, in append
    for name in namesAndValues.viewkeys()
  File "/home/users/thorst/wcEcoli/wholecell/io/tablewriter.py", line 299, in <dictcomp>
    for name in namesAndValues.viewkeys()
  File "/home/users/thorst/wcEcoli/wholecell/io/tablewriter.py", line 94, in __init__
    filepath.makedirs(path)
  File "/home/users/thorst/wcEcoli/wholecell/utils/filepath.py", line 37, in makedirs
    os.makedirs(full_path)
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 24] Too many open files: '/home/users/thorst/wcEcoli/out/20190819.181225__SET_A_32_gens_8_seeds_basal_with_growth_noise_and_D_period/wildtype_000000/000001/generation_000011/000000/simOut/RnaSynthProb/columns/pPromoterBound'
Exception AttributeError: "'_Column' object has no attribute '_data'" in <bound method _Column.__del__ of <wholecell.io.tablewriter._Column object at 0x7f7a285df990>> ignored

If we keep running into this issue, it will take a long time to run the sims on sherlock so we should probably troubleshoot or consider alternatives.

A quick check with some running jobs on Sherlock gives a very low open number of files so it seems odd we'd be hitting the limit of 131072 unless it's dependent on other users, the limit is not actually that high or we have a spike of open files at some point.

Node with one sim:

$ lsof -u thorst | wc -l
529

Node with one multigen analysis:

$ lsof -u thorst | wc -l
266

Node with one single analysis:

$ lsof -u thorst | wc -l
303

@tahorst tahorst reopened this Aug 20, 2019
@tahorst
Copy link
Member Author

tahorst commented Aug 21, 2019

About 10% of the sims failed overnight because of too many open files. I've placed some print statements (added the # comments afterwords to show the command) before and after the firework runs as well as right before the exception is raised in wholecell.utils.filepath.makedirs to see how many open files there are and it seems we are far away from the limit unless Sherlock is placing some other limit on it:

File limit - 131072  # ulimit -Hn
Open files - 27746  # lsof | wc -l
Open user files - 6505  # lsof -u thorst | wc -l

/home/users/thorst/wcEcoli:/home/users/thorst/wcEcoli:/home/users/thorst/wcEcoli:
2019-08-21 10:13:27,080 INFO Hostname/IP lookup (this will take a few seconds)
2019-08-21 10:13:27,081 INFO Launching Rocket
2019-08-21 10:14:36,533 INFO RUNNING fw_id: 1277 in directory: /home/users/thorst/wcEcoli/block_2019-08-21-13-23-41-155703/launcher_2019-08-21-17-11-07-215209
2019-08-21 10:14:36,696 INFO Task started: {{wholecell.fireworks.firetasks.simulation.SimulationTask}}.
Wed Aug 21 10:14:36 2019: Running simulation

Process files: 182  # lsof -p <pid>
User files: 6819  # lsof -u thorst
All files: 29395  # lsof

Time (s)  Dry mass     Dry mass      Protein          RNA    Small mol     Expected
              (fg)  fold change  fold change  fold change  fold change  fold change
========  ========  ===========  ===========  ===========  ===========  ===========
    0.00    127.67        1.000        1.000        1.000        1.000        1.000

Process files: 328  # lsof -p <pid>
User files: 6965  # lsof -u thorst
All files: 29980  # lsof

2019-08-21 10:15:08,758 INFO Rocket finished

Open files - 28696  # lsof | wc -l
Open user files - 6683  # lsof -u thorst | wc -l

@tahorst
Copy link
Member Author

tahorst commented Aug 21, 2019

From Kilian:

We think we found where this may be coming from. We have a fix that we started rolling out, but it will likely take some time to make its way to all the nodes. So in the meantime, I'd recommend trying to work in $HOME, $GROUP_HOME or $OAK (basically anywhere other than $SCRATCH), those should not present the same limitation.

Looks like it was a sherlock issue so hopefully this will go away soon

@tahorst
Copy link
Member Author

tahorst commented Aug 22, 2019

This time the fix actually worked. No failed runs overnight due to the filesystem

@tahorst tahorst closed this as completed Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants