Reducing the data footprint for large simulations #1141

charliejharrison · 2021-08-06T16:07:57Z

We would like to explore what happens to the model over large numbers of generations, spanning upwards of 10⁶ cells. To make this feasible we would need to drastically reduce the amount of data that is saved to disk, removing most of the tables and analysis runs. Data serialisation currently involves a number of classes (Listeners, Loggers, Processes) and happens in a few places, and I'm not sure whether the model relies on any of these for calculations or whether they can be safely excised.

Can data serialisation be made more configurable?

The text was updated successfully, but these errors were encountered:

tahorst · 2021-08-11T13:27:45Z

This would be a great feature to add! I'm copying my original response to your email below for completeness on this thread (I hope it was enough to get you going) and listing some proposals on how to address this in a better way moving forward.

Current way of doing it manually:
I think if you comment out any listeners listed here that you don't want, then they should not be written to disk. There's a chance that some processes depend on data in listeners as we're running sims and you won't be able to remove those (we try to limit this as much as possible to keep processes/states the only things that are needed during simulation and listeners as the interface with disk) but this should be the simplest way of doing it. Logging to disk is set up in the simulation.py code here, this function is what actually sets the files that will be written (you can see internal_states like BulkMolecules and all listeners are here), and I think it should write to disk with this function call after every timestep. If you don't want to save bulk molecules in your sims, you will need to remove it from createTables.

Proposal for making more configurable:

accept a config file that replaces the configuration in simulation.py. We can have a default config file that behaves the same way but optionally accept a different file for workflows and manual scripts to reduce the number of listeners used
remove simulation dependence on listeners - there are a few places in sims that read values from listeners but these could be made into states so that processes and states communicate during simulation and listeners are a one way route to disk (CellProperties state #512, Redundant volume/concentration calculations #80)
specify the listeners that get written to disk in the config file and pass it to this function instead of defaulting to the classes that are chained together

I would love to get your thoughts on this @1fish2 (once you're back from your roadtrip!)

1fish2 · 2021-08-21T00:42:38Z

It looks like we could turn off all the output tables by setting the option logToDisk to False when constructing the simulation and simulationDaughter Firetasks. (wholecell/sim/simulation.py copies this option into the Simulation._logToDisk attribute -- it's tricky). All the listeners would still run in memory but none of the listeners, internal_states, or external_states would write to disk.

While you're at it, set logToShell to False to turn off most of the console output and tweak divide_cell.divide_cell() to only write one of the two daughter cell inherited state files.

Removing classes from _listenerClasses would save additional in-memory work but as @tahorst noted, the code expects to

read from the "Mass" listener: readFromListener("Mass", "cellMass"), self.listeners['Mass'].volume
write to the "EvaluationTime" listener: self.listeners["EvaluationTime"]
write to "Main" per special-casing in Disk; not even listed in _listenerClasses.

Yes, it'd be cleaner to turn those listeners into states.

To make this configurable, @tahorst's idea of removing listeners from _listenerClasses sounds good for all but "Mass", "EvaluationTime", and "Main". To configure specific listeners (including those 3) and internal/external states from writing to disk, Disk.createTables() could filter them out, again with special-casing for "Main".

BTW, all this is easy in vivarium-ecoli: Just configure Store variables to have _emit = False or let them default to False per the Vivarium framework.

tahorst added the enhancement label Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing the data footprint for large simulations #1141

Reducing the data footprint for large simulations #1141

charliejharrison commented Aug 6, 2021

tahorst commented Aug 11, 2021

1fish2 commented Aug 21, 2021

Reducing the data footprint for large simulations #1141

Reducing the data footprint for large simulations #1141

Comments

charliejharrison commented Aug 6, 2021

tahorst commented Aug 11, 2021

1fish2 commented Aug 21, 2021