Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuildCausalityNetworkTask --> Too many open files #3

Open
1fish2 opened this issue Jul 10, 2019 · 3 comments
Open

BuildCausalityNetworkTask --> Too many open files #3

1fish2 opened this issue Jul 10, 2019 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@1fish2
Copy link
Collaborator

1fish2 commented Jul 10, 2019

BuildCausalityNetworkTask writes 29100 json files into a sim generation's seriesOut/ dir, leading to a FileNotFoundException (Too many open files) in Sisyphus.

That's immediately followed by a java.lang.ClassCastException: class java.lang.Character cannot be cast to class java.util.Map$Entry which might be an error handling bug.

Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]: java.io.FileNotFoundException: /tmp/sisyphus/outputs/data/jerry/20190710.140901/wildtype_000000/000000/generation_000000/000000/seriesOut/8699366258568945.json (Too many open files)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.io.FileInputStream.open0(Native Method)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$fn__11466.invokeStatic(io.clj:229)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$fn__11466.invoke(io.clj:229)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$fn__11379$G__11372__11386.invoke(io.clj:69)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$input_stream.invokeStatic(io.clj:136)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.java.io$input_stream.doInvoke(io.clj:121)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.lang.RestFn.invoke(RestFn.java:410)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.archive$pack_BANG_.invokeStatic(archive.clj:71)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.archive$pack_BANG_.invoke(archive.clj:58)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$push_output_BANG_.invokeStatic(task.clj:64)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$push_output_BANG_.invoke(task.clj:60)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$perform_task_BANG_.invokeStatic(task.clj:195)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$perform_task_BANG_.invoke(task.clj:139)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.core$sisyphus_handle_rabbit.invokeStatic(core.clj:109)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.core$sisyphus_handle_rabbit.invoke(core.clj:100)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$partial$fn__5824.invoke(core.clj:2626)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at langohr.consumers$create_default$fn__18445.invoke(consumers.clj:83)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at langohr.consumers.proxy$com.rabbitmq.client.DefaultConsumer$ff19274a.handleDelivery(Unknown Source)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:104)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.lang.Thread.run(Thread.java:834)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]: SEVERE sisyphus: rabbit-task
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]: java.lang.ClassCastException: class java.lang.Character cannot be cast to class java.util.Map$Entry (java.lang.Character and java.util.Map$Entry are in module java.base of loader 'bootstrap')
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.lang.APersistentMap.cons(APersistentMap.java:42)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.lang.RT.conj(RT.java:673)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$conj__5375.invokeStatic(core.clj:85)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$merge$fn__5943.invoke(core.clj:3049)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$reduce1.invokeStatic(core.clj:944)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$reduce1.invokeStatic(core.clj:934)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$merge.invokeStatic(core.clj:3048)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$merge.doInvoke(core.clj:3041)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.lang.RestFn.invoke(RestFn.java:421)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$send_BANG_.invokeStatic(task.clj:113)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$send_BANG_.invoke(task.clj:106)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$exception_BANG_.invokeStatic(task.clj:137)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$exception_BANG_.invoke(task.clj:134)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$perform_task_BANG_.invokeStatic(task.clj:209)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.task$perform_task_BANG_.invoke(task.clj:139)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.core$sisyphus_handle_rabbit.invokeStatic(core.clj:109)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at sisyphus.core$sisyphus_handle_rabbit.invoke(core.clj:100)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at clojure.core$partial$fn__5824.invoke(core.clj:2626)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at langohr.consumers$create_default$fn__18445.invoke(consumers.clj:83)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at langohr.consumers.proxy$com.rabbitmq.client.DefaultConsumer$ff19274a.handleDelivery(Unknown Source)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:104)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
Jul 10 21:55:14 sisyphus-jerry-0 bash[1251]:         at java.base/java.lang.Thread.run(Thread.java:834)
@1fish2 1fish2 added the bug Something isn't working label Jul 10, 2019
@prismofeverything
Copy link
Member

Interesting.... this happens during archiving, which I have a feature branch that no longer archives, but instead uploads each file so this would probably not happen in that case.

That said, 21k files per sim is a lot! Maybe too many. Is it possible to refactor the causality network task to emit a single archive of the 21k json files, then the CN opens that? Object store works best with fewer larger files than multitudes of small files like this.

To solve this in the immediate term we can always raise the open file limit (here is a dumb tutorial for this): https://easyengine.io/tutorials/linux/increase-open-files-limit/

@1fish2
Copy link
Collaborator Author

1fish2 commented Jul 10, 2019

Storing individual files will be great.

If we want to keep the Sisyphus archive feature, is this just a matter of closing each file after adding it to the archive rather than accumulating open files?

Maybe the Causality builder should put all the json files into an archive, but they might have to get unpacked for the web viewer web page. I suspect the json files are organized for the Causality viewer to open incrementally. That's probably related to the viewer's limitations that keep it from being able to fit more than ≈2 generations in memory.

1fish2 added a commit to CovertLab/wcEcoli that referenced this issue Jul 13, 2019
`wcm.py` is like `fw_queue.py` but it takes inputs from an argparse CLI, builds a Gaia workflow, launches a suitable number of Sisyphus workers, then submits the workflow to the Gaia server. The workers will time out afterwards.

The steps to use it will get simpler and better documented. At the moment:

1. Install `gcloud` and authenticate (one time setup).
2. Build the wcEcoli "runtime" container image if the pip requirements changed: `cloud/build-runtime.sh`
3. Build your own wcEcoli "wcm" code container image if you changed the code in your workspace (specifically the firetasks and the code they call): `cloud/build-wcm.sh`
   * It will be named `$USER-wcm-code` by default. You can give it an `ID` argument if you want more than one container image.
4. ssh to the Gaia server, opening a tunnel to its Gaia server port and (for now) to the Kafka cluster: `runscripts/sisyphus/ssh-tunnel.sh`
   (See that shell script for another setup step, soon to be obsolete.)
5. In another terminal tab, with the wcEcoli directory on the `PYTHONPATH`, run the workflow builder, e.g.: `python runscripts/sisyphus/wcm.py -g2 -c2`.
   * The option `-g2` means 2 generations, and `-c2` requests 2 CPUs on each worker node which matches their current configuration. `-c2` will run the parallel Parca and it might speed up the analyses.
   * Each worker node runs one task at a time.
   * The workflow builder launches `variant_count * init_sims` workers by default. The `--workers` argument overrides that.
   * Don't use `--build_causality_network` yet because it will [break the current Sisyphus code](CovertLab/sisyphus#3).
   ```
   usage: wcm.py [-h] [--verbose [VERBOSE] | --no_verbose] [-c CPUS]
                 [--dump [DUMP] | --no_dump] [-w WORKERS]
                 [--ribosome_fitting [RIBOSOME_FITTING] | --no_ribosome_fitting]
                 [--rnapoly_fitting [RNAPOLY_FITTING] | --no_rnapoly_fitting]
                 [--debug_parca [DEBUG_PARCA] | --no_debug_parca]
                 [-v VARIANT_TYPE FIRST_INDEX LAST_INDEX] [-g GENERATIONS]
                 [-i INIT_SIMS] [-t TIMELINE] [--length_sec LENGTH_SEC]
                 [--timestep_safety_frac TIMESTEP_SAFETY_FRAC]
                 [--timestep_max TIMESTEP_MAX]
                 [--timestep_update_freq TIMESTEP_UPDATE_FREQ]
                 [--mass_distribution [MASS_DISTRIBUTION] |
                 --no_mass_distribution] [--growth_rate_noise [GROWTH_RATE_NOISE]
                 | --no_growth_rate_noise]
                 [--d_period_division [D_PERIOD_DIVISION] |
                 --no_d_period_division]
                 [--translation_supply [TRANSLATION_SUPPLY] |
                 --no_translation_supply] [--trna_charging [TRNA_CHARGING] |
                 --no_trna_charging] [--run_analysis [RUN_ANALYSIS] |
                 --no_run_analysis] [-p PLOT [PLOT ...]]
                 [--build_causality_network [BUILD_CAUSALITY_NETWORK] |
                 --no_build_causality_network]
   ```
6. Watch the logs via the [GCP Logs Viewer](https://console.cloud.google.com/logs/viewer?resource=gce_instance&project=allen-discovery-center-mcovert&organizationId=302681460499&minLogLevel=0&expandAll=false&timestamp=2019-07-10T22:47:58.497000000Z&customFacets=&limitCustomFacetWidth=true&interval=PT1H&scrollTimestamp=2019-07-10T21:55:14.740000000Z&dateRangeStart=2019-07-10T21:47:58.496Z&dateRangeUnbound=forwardInTime) set to "GCE VM Instance".
7. Download the outputs via the [Google Cloud Storage browser](https://console.cloud.google.com/storage/browser/sisyphus/data/?project=allen-discovery-center-mcovert&organizationId=302681460499) or (soon) by mounting it via gcsfuse.

Additional code changes:
* Each firetask is now responsible for creating its output directories since the tasks have access to the correct file system while the builder does not.
* Factor out a subroutine to name the output variant dirs rather than replicate that.
* Add a ParcaTask firetask that bundles the 4 existing firetasks into one that fits Sisyphus' functional model where inputs and outputs are distinct files and directories.
* BuildCausalityNetworkTask also breaks the functional model by treating `output_network_directory` as an output once per variant, otherwise as an input. The builder works around that by asking it to write its network and dynamics into the same directory. That does mean recomputing the network per sim rather than sharing it per variant, which we could optimize by moving the network part of the builder to VariantSimDataTask, but in practice it would save very little space and time compared to the rest of its work.
@1fish2
Copy link
Collaborator Author

1fish2 commented Jul 24, 2019

See CovertLab/wcEcoli#605

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants