BuildCausalityNetworkTask --> Too many open files #3

1fish2 · 2019-07-10T22:14:03Z

BuildCausalityNetworkTask writes 29100 json files into a sim generation's seriesOut/ dir, leading to a FileNotFoundException (Too many open files) in Sisyphus.

That's immediately followed by a java.lang.ClassCastException: class java.lang.Character cannot be cast to class java.util.Map$Entry which might be an error handling bug.

Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]: java.io.FileNotFoundException: /tmp/sisyphus/outputs/data/jerry/20190710.140901/wildtype_000000/000000/generation_000000/000000/seriesOut/8699366258568945.json (Too many open files)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.io.FileInputStream.open0(Native Method)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$fn__11466.invokeStatic(io.clj:229)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$fn__11466.invoke(io.clj:229)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$fn__11379$G__11372__11386.invoke(io.clj:69)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$input_stream.invokeStatic(io.clj:136)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$input_stream.doInvoke(io.clj:121)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.lang.RestFn.invoke(RestFn.java:410)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.archive$pack_BANG_.invokeStatic(archive.clj:71)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.archive$pack_BANG_.invoke(archive.clj:58)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$push_output_BANG_.invokeStatic(task.clj:64)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$push_output_BANG_.invoke(task.clj:60)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$perform_task_BANG_.invokeStatic(task.clj:195)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$perform_task_BANG_.invoke(task.clj:139)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.core$sisyphus_handle_rabbit.invokeStatic(core.clj:109)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.core$sisyphus_handle_rabbit.invoke(core.clj:100)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$partial$fn__5824.invoke(core.clj:2626)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at langohr.consumers$create_default$fn__18445.invoke(consumers.clj:83)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at langohr.consumers.proxy$com.rabbitmq.client.DefaultConsumer$ff19274a.handleDelivery(Unknown Source)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:104)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.lang.Thread.run(Thread.java:834)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]: SEVERE sisyphus: rabbit-task
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]: java.lang.ClassCastException: class java.lang.Character cannot be cast to class java.util.Map$Entry (java.lang.Character and java.util.Map$Entry are in module java.base of loader 'bootstrap')
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.lang.APersistentMap.cons(APersistentMap.java:42)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.lang.RT.conj(RT.java:673)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$conj__5375.invokeStatic(core.clj:85)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$merge$fn__5943.invoke(core.clj:3049)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$reduce1.invokeStatic(core.clj:944)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$reduce1.invokeStatic(core.clj:934)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$merge.invokeStatic(core.clj:3048)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$merge.doInvoke(core.clj:3041)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.lang.RestFn.invoke(RestFn.java:421)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$send_BANG_.invokeStatic(task.clj:113)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$send_BANG_.invoke(task.clj:106)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$exception_BANG_.invokeStatic(task.clj:137)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$exception_BANG_.invoke(task.clj:134)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$perform_task_BANG_.invokeStatic(task.clj:209)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$perform_task_BANG_.invoke(task.clj:139)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.core$sisyphus_handle_rabbit.invokeStatic(core.clj:109)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.core$sisyphus_handle_rabbit.invoke(core.clj:100)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$partial$fn__5824.invoke(core.clj:2626)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at langohr.consumers$create_default$fn__18445.invoke(consumers.clj:83)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at langohr.consumers.proxy$com.rabbitmq.client.DefaultConsumer$ff19274a.handleDelivery(Unknown Source)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:104)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.lang.Thread.run(Thread.java:834)

The text was updated successfully, but these errors were encountered:

prismofeverything · 2019-07-10T22:36:39Z

Interesting.... this happens during archiving, which I have a feature branch that no longer archives, but instead uploads each file so this would probably not happen in that case.

That said, 21k files per sim is a lot! Maybe too many. Is it possible to refactor the causality network task to emit a single archive of the 21k json files, then the CN opens that? Object store works best with fewer larger files than multitudes of small files like this.

To solve this in the immediate term we can always raise the open file limit (here is a dumb tutorial for this): https://easyengine.io/tutorials/linux/increase-open-files-limit/

1fish2 · 2019-07-10T22:46:21Z

Storing individual files will be great.

If we want to keep the Sisyphus archive feature, is this just a matter of closing each file after adding it to the archive rather than accumulating open files?

Maybe the Causality builder should put all the json files into an archive, but they might have to get unpacked for the web viewer web page. I suspect the json files are organized for the Causality viewer to open incrementally. That's probably related to the viewer's limitations that keep it from being able to fit more than ≈2 generations in memory.

`wcm.py` is like `fw_queue.py` but it takes inputs from an argparse CLI, builds a Gaia workflow, launches a suitable number of Sisyphus workers, then submits the workflow to the Gaia server. The workers will time out afterwards. The steps to use it will get simpler and better documented. At the moment: 1. Install `gcloud` and authenticate (one time setup). 2. Build the wcEcoli "runtime" container image if the pip requirements changed: `cloud/build-runtime.sh` 3. Build your own wcEcoli "wcm" code container image if you changed the code in your workspace (specifically the firetasks and the code they call): `cloud/build-wcm.sh` * It will be named `$USER-wcm-code` by default. You can give it an `ID` argument if you want more than one container image. 4. ssh to the Gaia server, opening a tunnel to its Gaia server port and (for now) to the Kafka cluster: `runscripts/sisyphus/ssh-tunnel.sh` (See that shell script for another setup step, soon to be obsolete.) 5. In another terminal tab, with the wcEcoli directory on the `PYTHONPATH`, run the workflow builder, e.g.: `python runscripts/sisyphus/wcm.py -g2 -c2`. * The option `-g2` means 2 generations, and `-c2` requests 2 CPUs on each worker node which matches their current configuration. `-c2` will run the parallel Parca and it might speed up the analyses. * Each worker node runs one task at a time. * The workflow builder launches `variant_count * init_sims` workers by default. The `--workers` argument overrides that. * Don't use `--build_causality_network` yet because it will [break the current Sisyphus code](CovertLab/sisyphus#3). ``` usage: wcm.py [-h] [--verbose [VERBOSE] | --no_verbose] [-c CPUS] [--dump [DUMP] | --no_dump] [-w WORKERS] [--ribosome_fitting [RIBOSOME_FITTING] | --no_ribosome_fitting] [--rnapoly_fitting [RNAPOLY_FITTING] | --no_rnapoly_fitting] [--debug_parca [DEBUG_PARCA] | --no_debug_parca] [-v VARIANT_TYPE FIRST_INDEX LAST_INDEX] [-g GENERATIONS] [-i INIT_SIMS] [-t TIMELINE] [--length_sec LENGTH_SEC] [--timestep_safety_frac TIMESTEP_SAFETY_FRAC] [--timestep_max TIMESTEP_MAX] [--timestep_update_freq TIMESTEP_UPDATE_FREQ] [--mass_distribution [MASS_DISTRIBUTION] | --no_mass_distribution] [--growth_rate_noise [GROWTH_RATE_NOISE] | --no_growth_rate_noise] [--d_period_division [D_PERIOD_DIVISION] | --no_d_period_division] [--translation_supply [TRANSLATION_SUPPLY] | --no_translation_supply] [--trna_charging [TRNA_CHARGING] | --no_trna_charging] [--run_analysis [RUN_ANALYSIS] | --no_run_analysis] [-p PLOT [PLOT ...]] [--build_causality_network [BUILD_CAUSALITY_NETWORK] | --no_build_causality_network] ``` 6. Watch the logs via the [GCP Logs Viewer](https://console.cloud.google.com/logs/viewer?resource=gce_instance&project=allen-discovery-center-mcovert&organizationId=302681460499&minLogLevel=0&expandAll=false&timestamp=2019-07-10T22:47:58.497000000Z&customFacets=&limitCustomFacetWidth=true&interval=PT1H&scrollTimestamp=2019-07-10T21:55:14.740000000Z&dateRangeStart=2019-07-10T21:47:58.496Z&dateRangeUnbound=forwardInTime) set to "GCE VM Instance". 7. Download the outputs via the [Google Cloud Storage browser](https://console.cloud.google.com/storage/browser/sisyphus/data/?project=allen-discovery-center-mcovert&organizationId=302681460499) or (soon) by mounting it via gcsfuse. Additional code changes: * Each firetask is now responsible for creating its output directories since the tasks have access to the correct file system while the builder does not. * Factor out a subroutine to name the output variant dirs rather than replicate that. * Add a ParcaTask firetask that bundles the 4 existing firetasks into one that fits Sisyphus' functional model where inputs and outputs are distinct files and directories. * BuildCausalityNetworkTask also breaks the functional model by treating `output_network_directory` as an output once per variant, otherwise as an input. The builder works around that by asking it to write its network and dynamics into the same directory. That does mean recomputing the network per sim rather than sharing it per variant, which we could optimize by moving the network part of the builder to VariantSimDataTask, but in practice it would save very little space and time compared to the rest of its work.

1fish2 · 2019-07-24T05:19:31Z

See CovertLab/wcEcoli#605

1fish2 added the bug Something isn't working label Jul 10, 2019

1fish2 assigned prismofeverything Jul 10, 2019

This was referenced Jul 10, 2019

WCM workflow to run on Gaia + Sisyphus CovertLab/wcEcoli#591

Closed

Gaia workflow CovertLab/wcEcoli#592

Merged

tahorst mentioned this issue Jul 11, 2019

Too many open files CovertLab/wcEcoli#593

Closed

1fish2 mentioned this issue Jul 24, 2019

make BuildCausalityNetworkTask write one compact file CovertLab/wcEcoli#605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BuildCausalityNetworkTask --> Too many open files #3

BuildCausalityNetworkTask --> Too many open files #3

1fish2 commented Jul 10, 2019 •

edited

Loading

prismofeverything commented Jul 10, 2019

1fish2 commented Jul 10, 2019

1fish2 commented Jul 24, 2019

BuildCausalityNetworkTask --> Too many open files #3

BuildCausalityNetworkTask --> Too many open files #3

Comments

1fish2 commented Jul 10, 2019 • edited Loading

prismofeverything commented Jul 10, 2019

1fish2 commented Jul 10, 2019

1fish2 commented Jul 24, 2019

1fish2 commented Jul 10, 2019 •

edited

Loading