Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hist fixed Merge (For Discussion Only!!!) #190

Open
wants to merge 88 commits into
base: prep_release
Choose a base branch
from
Open

Conversation

pgierz
Copy link
Member

@pgierz pgierz commented Oct 13, 2021

Hi all,

I'd like to at some point get rid of this hist_fixed branch. It seems to be "long lived", and that's not really the point of a branch that only fixes something.

So, what actually needed to be changed to get the historical run to work? What of that is not yet in prep release?

@dbarbi @mandresm @chrisdane @denizural @christian-stepanek @ fernandadialzira

Please ping anyone I forgot.

Dirk Barbi and others added 30 commits November 25, 2020 10:54
Exception to add_<env_vars> added to the environment case checker
…ists in the experiment tree

If a run crashes, and the user has been working in the virtual environment, they needed to remember to reactivate the environment to continue the run. Now, this check happens for you automatically and the virtual environment under ``$BASE_DIR/$EXP_ID/.venv_esmtools`` is reused if it exists.
feat(virtual_env_builder): recycles the virtual env if one already exists in the experiment tree
allows to exit right away from venv question
…t_models

Hotfix/coupling fields different models
… section of a simulation. By default the only reusable files are 'bin' and 'src'. With this change it is possible to also reuse, for example, 'input'. Reused subfolders can be now copied correctly into the work directory (before it was making a mess)
Reusable_filetypes in config files
@chrisdane
Copy link

Hi all

I am currently using this branch at 51d71e6 and it works fine to run a awicm-1.0-recom historical simulation.

Can I do anything?

Cheers,
Chris

@pgierz
Copy link
Member Author

pgierz commented Oct 14, 2021

Hi @chrisdane,

the boys are at the CliDyn retreat today and tomorrow, so I'm the only guy running support. Can you try merging, see how much of the conflicts are just dumb stuff like version numbers and whatnot? That would cut down on the conflicts.

@mandresm
Copy link
Contributor

mandresm commented Oct 14, 2021

I've just realized that this is esm_runscripts and not esm_tools. So I can definitely say something about this: many of the lines are things that existed in release but they won't exist anymore or are differently written in prep_release. I can take care of this merge on monday, as I have done most of the latest merges into prep_release and on accommodating the differences.

@mandresm
Copy link
Contributor

mandresm commented Oct 15, 2021

So, I've checked the history of this branch and it consists of 2 commits, one made by @pgierz to add an additional functionality to the echam namelists, and one by @chrisdane to remove all the changes of the previous commit. So this branch at its current state is not different at all from what is included in release and prep_release. So, in principle @chrisdane could be using release. The question for me is whether we want to finish including the feature @pgierz started on 5389c6d, or not.

Here a4080c5, you can find the actual merge into prep_release of the hist_fixed branch right before it started to diverge with @pgierz and @chrisdane's last 2 commits

@pgierz
Copy link
Member Author

pgierz commented Oct 15, 2021

The work for that commit is halfway done. @chrisdane Found some inconsistencies, it's on my list.

@pgierz
Copy link
Member Author

pgierz commented Oct 18, 2021

From email discussion earlier:

Hi Christian,

From my last information Fernanda has a running Historical setup, and is just waiting for the results. Fernanda, fingers crossed it all works :-)

I’m currently in the process of setting up automatic tests. These will run on any commit made to esm-tools. I would say we test that all scenarios at least find all the correct files and are able to run the first and second year (so, once a “cold start” and then the next restart) I am slightly grumbly about wasting electricity on things that should work, but I would rather we waste a little bit than spend weeks debugging problems for students and production runs for the senior scientists. What do you think?

Miguel, I may need to talk with you and Deniz briefly. We might need something that tells the sad runscript file to “wait” and give back the model exit code for the auto-tests to work. And we should also really consider renaming this sad file to a run file. It makes me sad whenever I look at it ;-)

Christian, can you post any other info directly on github? Then it is available for anyone looking into the discussion, sorry if the system did not set that up for you, perhaps I misconfigured something.
#190

All the best
Paul

@christian-stepanek
Copy link

There are currently various conflicts - is this something that can be quickly solved?

juschrep@ollie1:/work/ollie/juschrep/AWIESM_Example5/pico-fesom/software/github.com/esm-tools/esm_tools$ git merge origin/hist_fixed
Removing namelists/echam/6.3.04p1/PI-CTRL/namelist.echam~HEAD_0
Removing namelists/echam/6.3.04p1/PI-CTRL/namelist.echam~HEAD
Removing namelists/echam/6.3.04p1/PALEO/namelist.echam~HEAD_0
Removing namelists/echam/6.3.04p1/PALEO/namelist.echam~HEAD
Removing namelists/echam/6.3.04p1/HIST/namelist.echam~HEAD_0
Removing namelists/echam/6.3.04p1/HIST/namelist.echam~HEAD
Removing namelists/echam/6.3.04p1/4CO2/namelist.echam~HEAD_0
Removing namelists/echam/6.3.04p1/4CO2/namelist.echam~HEAD
Removing namelists/echam/6.3.04p1/1percCO2/namelist.echam~HEAD_0
Removing namelists/echam/6.3.04p1/1percCO2/namelist.echam~HEAD
Removing namelists/echam/6.3.04p1/1950/namelist.echam~HEAD_0
Removing namelists/echam/6.3.04p1/1950/namelist.echam~HEAD
Removing namelists/echam/6.3.02p4/SCEN/namelist.echam~HEAD_0
Removing namelists/echam/6.3.02p4/SCEN/namelist.echam~HEAD
Removing namelists/echam/6.3.02p4/RCP85/namelist.echam~HEAD_0
Removing namelists/echam/6.3.02p4/RCP85/namelist.echam~HEAD
Removing namelists/echam/6.3.02p4/RCP45/namelist.echam~HEAD_0
Removing namelists/echam/6.3.02p4/RCP45/namelist.echam~HEAD
Removing namelists/echam/6.3.02p4/RCP26/namelist.echam~HEAD_0
Removing namelists/echam/6.3.02p4/RCP26/namelist.echam~HEAD
Removing namelists/echam/6.3.02p4/PI-CTRL/namelist.echam~HEAD_0
Removing namelists/echam/6.3.02p4/PI-CTRL/namelist.echam~HEAD
Removing namelists/echam/6.3.02p4/PALEO/namelist.echam~HEAD_0
Removing namelists/echam/6.3.02p4/PALEO/namelist.echam~HEAD
Removing namelists/echam/6.3.02p4/1990/namelist.echam~HEAD_0
Removing namelists/echam/6.3.02p4/1990/namelist.echam~HEAD
Removing namelists/echam/6.3.02p4/1950/namelist.echam~HEAD_0
Removing namelists/echam/6.3.02p4/1950/namelist.echam~HEAD
Removing namelists/echam/6.3.02p4/1850/namelist.echam~HEAD_0
Removing namelists/echam/6.3.02p4/1850/namelist.echam~HEAD
Auto-merging install.sh
CONFLICT (content): Merge conflict in install.sh
Auto-merging configs/setups/awicm/awicm-2.0.yaml
Auto-merging configs/components/fesom/fesom.yaml
CONFLICT (content): Merge conflict in configs/components/fesom/fesom.yaml
Auto-merging configs/components/fesom/fesom-2.0.yaml
CONFLICT (content): Merge conflict in configs/components/fesom/fesom-2.0.yaml
Auto-merging configs/components/echam/echam.yaml
Auto-merging configs/components/echam/echam.datasets.yaml
Automatic merge failed; fix conflicts and then commit the result.

@mandresm
Copy link
Contributor

And we should also really consider renaming this sad file to a run file. It makes me sad whenever I look at it ;-)

I've just made an issue in Jira for that. I'd say we do this when we have the monorepo up and running.

@pgierz
Copy link
Member Author

pgierz commented Oct 18, 2021

I will have a look at the various merge conflicts. Christian, I will put a note here once I am done, could you then please have your student give it a try?

@christian-stepanek
Copy link

Dear @pgierz @denizural @chrisdane. Our bachelor student Jule is a bit under time pressure to start a historical simulation with AWI-ESM2.1. Do you have an idea how severe the problems with hist_fixed are so that we can estimate whether it makes sense for her to run historical simulations based on esm-tools, or whether we have to look for an alternative? Please let me know in case the problem is on her side and not on side of the branch itself - so far I got the impression that the problem is rather in the branch. In case I can be of help to either fix the problem or help identifying the glitches, please also let me know.

Thanks a lot!
Christian

@denizural
Copy link
Contributor

Hi @christian-stepanek, is your student currently having any problems? If you encountered a runtime problem could you please inform us here?

@christian-stepanek
Copy link

christian-stepanek commented Nov 12, 2021

Dear @denizural, yes we currently still are having problems with getting the hist_fixed branch work together with Paul's awiesm-2.1-wiso version that includes PICO-FESOM coupling.

Thanks to a lot of help by @pgierz I have managed to get to the point where a historical simulation is successfully submitted. During starting phase all the necessary forcing files appear to arrive where they should do.

Yet, unfortunately, ECHAM6 crashes at startup due to a missing library (libpnetcdf).

cat /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/experiments/run_hist/run_18500101-18501231/scripts/run_hist_compute_33201705.log
...
330: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/experiments/run_hist/run_18500101-18501231/work/./echam6: error while loadin
g shared libraries: libpnetcdf.so: cannot open shared object file: No such file or directory
...

Interestingly the simulation does not halt at that point, but resubmits itself just to crash again with the same problem. We had problems of that kind before, likely the error code is not properly caught or interpreted.

To overcome the problem of the missing library @pgierz suggested to recompile awiesm-2.1-wiso after switching to the hist_fixed branches of esm-runscripts and esm-tools. Yet, that does not work, as apparently on that branch esm_master is not able to compile awiesm-2.1-wiso due to conflicting model sources (it expects other sources than available in that version of the model).

I have tried to overcome the crash due to the missing NetCDF library by adding the path, where the model (appears) to expect the library, to my PATH variable.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/sw/rhel6-x64/netcdf/parallel_netcdf-1.6.0-impi-intel14/lib
#before:
(base) a270061@mlogin102% ldd model_codes/awiesm-2.1-wiso/echam-6.3.05p2-wiso/bin/echam6 | grep netcdf
	libnetcdff.so.6 => /sw/rhel6-x64/netcdf/netcdf_fortran-4.4.2-parallel-impi-intel14/lib/libnetcdff.so.6 (0x00002b036b182000)
	libnetcdf.so.7 => /sw/rhel6-x64/netcdf/netcdf_c-4.3.2-gcc48/lib/libnetcdf.so.7 (0x00002b036b7ab000)
	libpnetcdf.so => not found
	libpnetcdf.so => /sw/rhel6-x64/netcdf/parallel_netcdf-1.6.0-impi-intel14/lib/libpnetcdf.so (0x00002b0371a09000)
#after:
(base) a270061@mlogin102% ldd model_codes/awiesm-2.1-wiso/echam-6.3.05p2-wiso/bin/echam6 | grep netcdf
	libnetcdff.so.6 => /sw/rhel6-x64/netcdf/netcdf_fortran-4.4.2-parallel-impi-intel14/lib/libnetcdff.so.6 (0x00002b39b9c3b000)
	libnetcdf.so.7 => /sw/rhel6-x64/netcdf/netcdf_c-4.3.2-gcc48/lib/libnetcdf.so.7 (0x00002b39ba264000)
	libpnetcdf.so => /sw/rhel6-x64/netcdf/parallel_netcdf-1.6.0-impi-intel14/lib/libpnetcdf.so (0x00002b39ba5b4000)

Indeed, the model then does not crash anymore and produces NetCDF output - yet, that output is corrupted.

I am at the end of my ideas and would really appreciate some help in getting this solved.

Here some information on where you find the model code, the version of the tools components, as well as a list of the most relevant changes that I did to get the hist_fixed branches into the work flow.

Path to simulation:

/mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/experiments/run_hist/

Path to model code:

/mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/model_codes/awiesm-2.1-wiso

esm_tools versions

(base) a270061@mlogin102% esm_versions check
esm_calendar
├─ version: 5.0.0
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

esm_database
├─ version: 5.0.0
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

esm_environment
├─ version: 5.1.3
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

esm_master
├─ version: 5.1.6
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

esm_motd
├─ version: 5.0.2
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

esm_parser
├─ version: 5.1.12
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

esm_plugin_manager
├─ version: 5.0.1
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

esm_profile
├─ version: 5.0.0
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

esm_rcfile
├─ version: 5.1.0
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

esm_runscripts
├─ version: 5.0.17
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/software/github.com/esm-tools/esm_runscripts
├─ branch: hist_fixed
└─ tags: v5.0.17-80-g51d71e6

esm_tools
├─ version: 5.1.23
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/software/github.com/esm-tools/esm_tools
├─ branch: hist_fixed
└─ tags: v5.1.23-3-ga9517323

esm_version_checker
├─ version: 5.1.5
├─ path: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/lib/python3.9/site-packages
├─ branch: 
└─ tags: 

Key steps taken to set up model, infrastructure, and simulation:

#get encapsuled model and esm-tools
git lfs clone --depth 1 --recurse-submodules https://gitlab.awi.de/paleodyn/Projects/awiesm-2.2-dev/pico-fesom.git

#install esm-tools
PROJECT_BASE="/mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom"
cd ${PROJECT_BASE}
cd software/github.com/esm-tools/esm_tools
./install.sh

#switch off restoring in sub_picocpl.f90 to not use PICO-FESOM-restoring
cd ${PROJECT_BASE}
joe model_codes/awiesm-2.1-wiso/fesom-2.0/src/sub_picocpl.F90
#edited as follows:
!CS: set restoring to zero
!  real(kind=WP)                    :: gammaT=1.e-4, gammaS=1.e-4
   real(kind=WP)                    :: gammaT=0.0, gammaS=0.0

#compile the model
cd ${PROJECT_BASE}/model_codes
esm_master comp-awiesm-2.1-wiso

#switch to hist_fixed branch for esm_tools
cd ${PROJECT_BASE}/software/github.com/esm-tools/esm_tools/
git checkout hist_fixed
cd ${PROJECT_BASE}

#switch to hist_fixed branch for esm_runscripts
git submodule add https://github.com/esm-tools/esm_runscripts software/github.com/esm-tools/esm_runscripts software/github.com/esm-tools/esm_runscripts
cd software/github.com/esm-tools/esm_runscripts/
git checkout hist_fixed
pip install -e .

#submit the simulation
cd ${PROJECT_BASE}/run_configs
esm_runscripts run_hist_working.yaml -e run_hist

@christian-stepanek
Copy link

christian-stepanek commented Nov 12, 2021

@mandresm I heard from Paul that you are working on getting the wiso flavor of awiesm-2.1 into the pre-release branch, and that you are also testing historical. Could you please give a remark when historical is successfully working on the pre-release branch, and when we could employ that branch for our work? Thanks a lot.

@mandresm
Copy link
Contributor

Dear @christian-stepanek ,

As far as I know awiesm-2.1-wiso is not supported in release branch, nor in hist_fixed. awiesm-2.1-wiso is supported in its own branch and that one does not include the historical fixes.

Currently, we are working on preparing version 6. The branch for version 6 has both integrated (standalone testing for historical runs is WIP, to be finished today), but I have not tested that they play well together. If you want, we can meet on monday to give it a try in the future release branch.

@christian-stepanek
Copy link

christian-stepanek commented Nov 12, 2021

Dear @mandresm,
I have a meeting on Monday at 11 am. We could meet before (10:30???). Thanks a lot.

Actually, at the current state of my mind I do not really care anymore whether wiso is included in the model version or not. If you could point me to any awiesm-2.1 and esm-tools capable of running a historical simulation, that would be marvelous. It may be something that is not released yet, as long as it runs for this one particular simulation.

We have here a bachelor student who has not much time to finish her research. My intention was to let her use a model version that is up-to-date to generate a historical simulation. Hence I have spent the last days trying to figure out how to get the hist_fixed branch into the awiesm-2.1-wiso version by @pgierz. Yet, maybe that plan was too aspirational.

In the end I am happy if our student has any awiesm-2.1 model and an esm-tools at hand that can successfully run a historical simulation. We would also run a PALEO simulation thereafter, but as far as I know that setup type is generally stable in esm-tools.

@christian-stepanek
Copy link

@mandresm I think we got very close to a running simulation, as outlined above. The hist_fixed branch was successfully implemented and the forcing files are also distributed to work. The last problem is the echam6 error with missing libpnetcdf.so. I think thereafter the simulation would actually run. In case you have any quick idea how this could be solved that would be marvelous as well.

@mandresm
Copy link
Contributor

Can you provide a runscript?

@christian-stepanek
Copy link

Of course.

Here: /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/run_configs/run_hist_working.yaml

The whole simulation runs in /mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/experiments/run_hist/

@mandresm
Copy link
Contributor

Can you also provide the esm_tools path? As the error message says the problem is on echam missing netcdf library, so that's a problem on the environment set for mistral so I need to have a look at the mistral.yaml

@christian-stepanek
Copy link

Of course, also in that project directory:

a270061@mlogin102% which esm_master
/mnt/lustre02/work/ab0246/a270061/awiesm-2.1-wiso_NetCDF_error/pico-fesom/.direnv/python-3.9.1/bin/esm_master

@mandresm
Copy link
Contributor

I've compared your echam compilation scripts to a working compilation script in or newest version and there are no differences that would affect the libnetcdf.

However, I see quite some changes in the echam source code. I'd suggest you do a clean installation of awiesm-2.1 first and check if there are libraries missing with ldd echam6. Then we can start to narrow down the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants