-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue writing preprocessed cube for CESM2-WACCM dataset #3738
Comments
@valeriupredoi - can you please look into what I might be doing wrong when you get a chance? Thanks! |
@rswamina could you please post the entire trace or the debug file, please? Not much I can gauge from that bit of output 😁 |
Yes, of course..will try to attache the whole main_log_debug.txt file attached above. Let me know if you cannot access it. |
If all preprocessor functions are lazy, the The timestamps in the attached debug log do not match those in the top post. |
I reran it to generate a fresh debug file. Please consider only the timestamps in the file. I waited around 30 minutes and then killed the job. How long should I wait for? |
I should add that I tried this several times wondering if there was an issue with JASMIN. The longest I waited (not sure if it was this particular run) was 30 minutes. |
I tried running the recipe on my laptop with just the first dataset
It completes in about 6 seconds and uses 2GB of RAM with ESMValCore v2.11.0 and 9 seconds / 4GB of RAM with v2.10.0. Maybe it is something specific to Jasmin indeed. Can you access the files that are listed in the top post, e.g. with |
I had a similar problem with other data that originated from using dask distributed with the |
From the debug log it looks like @rswamina is using the default scheduler (so no distributed). |
I am using the default scheduler. On JASMIN, SSP245 has just the r1 ensemble member's data. It seems to hang for both historical and ssp245 experiments. I tried an |
Am travelling back to the UK today, will have a look tommz or on Thu, before I fo that though - Ranjini, could you pls confirm there is enough room on the disk you writing to, and what iris and esmvalcore you using? Cheers 🍺 |
Thanks @valeriupredoi. I have enough space on disk. I am using the
I am not sure how to check the iris version though. |
Thanks Ranjini! I know how to do that 😁 Speaking of, I should install 2.11 rather sooner than later |
Hi @valeriupredoi - can you let me know if you diagnosed the problem? If this might take time, I will exclude the data set for now. Thanks! |
on it, will have a look today, apols for the delay, too many emails to check after holidays 🤣 |
hi Ranjini, am currently running that recipe, a few points:
|
@valeriupredoi - Thanks..I am outputting to
Are writes to scratch slower or as you indicate is this still a processing delay. I wonder why that is, it is not very much data. |
My guess is it's not data but rather, those region selections are intensive. Have left my run on overnight, see how long it takes 🍺 |
I am curious as to why. This worked quite quickly for @bouweandela on a different machine and the same preprocessing worked quickly again for other model data sets (same time period but perhaps different resolutions). I have successfully run this preprocessor on JASMIN before for the same data set but I can't remember if I just let it run overnight. |
OK, starting to make sense of this: the save is indeed waiting on a fairly large number of still running tasks (see the wait statement in
|
In fact, after a fair while, the |
So, what does this mean? |
found it! This codeblock trips it: for cube in cubes:
logger.debug("Saving cube:\n%s\nwith %s data to %s", cube,
"lazy" if cube.has_lazy_data() else "realized", filename) Unbelievably sneaky! |
@bouweandela have a look at this - simplifying for debugging that code block above (from for cube in cubes:
if cube.has_lazy_data():
data_state = "lazy"
else:
data_state = "realized"
logger.debug("Saving cube: %s with %s data to %s", cube, data_state, filename) the whole run hangs; commenting out the debug logger line: for cube in cubes:
if cube.has_lazy_data():
data_state = "lazy"
else:
data_state = "realized"
# logger.debug("Saving cube: %s with %s data to %s", cube, data_state, filename) run goes through at light speed! Something is really wonky at JASMIN with debug stringIO |
@rswamina you can now run the recipe no problem, I've taken out the offending bit - it is the issue with |
@valeriupredoi - many thanks for this!!! I will let you know if I have further issues. |
The |
So why does this not happen for other data sets? |
that is a one million dollar question! Here is a model code that mimicks our logging mechanism: import logging
import logging.config
import esmvalcore
import iris
import time
import yaml
from pathlib import Path
logger = logging.getLogger(__name__)
def _update_stream_level(cfg: dict, level=None):
"""Update the log level for the stream handlers."""
handlers = cfg['handlers']
for handler in handlers.values():
if level is not None and 'stream' in handler:
if handler['stream'] in ('ext://sys.stdout', 'ext://sys.stderr'):
handler['level'] = level.upper()
def configure_logging(cfg_file: str = None,
output_dir: str = None,
console_log_level: str = None) -> list:
if cfg_file is None:
cfg_loc = Path(esmvalcore.__file__ + "esmvalcore")
cfg_file = cfg_loc.parents[0] / 'config' / 'config-logging.yml'
cfg_file = Path(cfg_file).absolute()
with open(cfg_file) as file_handler:
cfg = yaml.safe_load(file_handler)
if output_dir is None:
_purge_file_handlers(cfg)
_update_stream_level(cfg, level=console_log_level)
logging.config.dictConfig(cfg)
logging.Formatter.converter = time.gmtime
logging.captureWarnings(True)
def loaddebug():
# configure logger
run_dir = "."
log_files = configure_logging(output_dir="testdebuglog",
console_log_level="debug")
c = iris.load_cube("testRS.nc")
logger.debug("crap")
logger.debug("cube %s", c)
return c
if __name__ == "__main__":
loaddebug() maybe you can make some sense with it - I'll just open a PR and plop a |
I suspect this issue may be related to the (un)availability of parallel writes on most Jasmin filesystems (see here). In particular, I would recommend choosing a storage location that supports parallel writes for the esmvaltool Could you try using e.g. |
that would be interesting to experiment with! But I don't think that's the case here, since if we don't include the |
@bouweandela I just reintroduced the original debug logger with |
Apologies..that was a typo. I used |
so this particular hang is due to that logger issue with |
@rswamina bit of an update here would be most welcome plss - I'd like to get ESMValGroup/ESMValCore#2526 in and we be done with this mystery 🍺 |
I still think this is a peculiarity that comes up and is related to some issue when using CESM2 on JASMIN. This is because I could not successfully extract multiple regions for multiple ensemble members and SSPs in one go without the script hanging again. IT worked when I reduced the number of region exctracions. Please bear with me for a few days and I will provide an example. |
ESMValTool seems to have trouble writing a preprocessed cube after extracting IPCC regions for the CEMS2-WACCM dataset. I don't know if the program hangs or simply is taking too long but the log message stops at a point where the cube is being saved under the preproc directory. I am using the
esmvaltool
module installed on JASMIN. As far as I can tell there are no typos!Here is the recipe:
Here are the final lines of the error message. After the last line shown, the program just waits and I don't know why it hangs there. This recipe works for other models.:
The text was updated successfully, but these errors were encountered: