Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fetch job and update stage_ic to work with fetched ICs #3141

Draft
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

DavidGrumm-NOAA
Copy link
Collaborator

Description

Most jobs require the initial conditions to be available on local disk. The existing “stage_ic” task copies/stages these initial condition into the experiment's COM directory. This PR for the “fetch” task extends that functionality to copy from HPSS (on HPSS-accessible machines) into COM.

This PR resolves issue “Stage initial conditions stored on HPSS (#2988)”.

This is currently a DRAFT PR.

Type of change

  • New feature (adds functionality)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? YES
  • Does this change require an update to any of the following submodules? NO

How has this been tested?

  • In process of being tested on Hera

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • [] I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

@DavidGrumm-NOAA
Copy link
Collaborator Author

I am in the process of testing.

@DavidGrumm-NOAA
Copy link
Collaborator Author

DavidGrumm-NOAA commented Dec 7, 2024

To test my code, I ran create_experiment with the short yaml C48_ATM.yaml, (which created /scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/testroot_1/EXPDIR and COMROOT) by :

HPC_ACCOUNT="fv3-cpu" MY_TESTROOT="/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/testroot_1" RUNTESTS=${MY_TESTROOT} pslot="1306a_2988" ./create_experiment.py --yaml ../ci/cases/pr/C48_ATM.yaml

… which completed without error or warning messages.

From within that EXPDIR I ran rocotorun:
rocotorun -w ./1306a_2988/1306a_2988.xml -d ./1306a_2988/1306a_2988.db

… which completed without error or warning messages. There was also no output to stdout, which I did not expect as I had placed a few diagnostic prints in my code. I verified that I am my current branch.

Runniing rocotostat gives me:

CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION

========================================================================
202103231200 gfs_stage_ic druby://10.184.8.62:37937 SUBMITTING - 0 0.0
202103231200 gfs_fcst_seg0 - - - - -
202103231200 gfs_atmos_prod_f000 - - - - -
etc.
… and this appears unchanged (at least for the 4 hours since I ran rocotorun)

I have 2 questions:

  • Am I not running my code, as the diagnostic prints do not appear ? ( I was able to run similarly modified versions of exglobal_fetch.py and fetch.py earlier when I had run fetch.sh, but I was running that script from the same directory. It simply may be a matter of manually resetting my PATH)

  • Rocotostat indicates that a job has been submitted (presumably with another version of the code)
    Shouldn’t I expect it to progress ?

@DavidHuber-NOAA
Copy link
Contributor

Rocoto is not a fully automated system. For each invocation of rocotorun, it checks the status of running jobs and updates its database accordingly, checks to see if any new jobs have met their prerequisites, and submits jobs to the queue that are ready to run. rocotostat simply reads the database and reports its contents. Thus, rocotorun must be run every few minutes to continuously submit jobs. Conveniently, in your EXPDIR, there is a file with extension .crontab. Copy the contents of this file to your crontab (which you can edit via the crontab -e command). Now your experiment should run continuously. For more details, read the docs.

Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some whitespace cleanup (most of it mine).

ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
@DavidGrumm-NOAA
Copy link
Collaborator Author

I updated the crontab.

@DavidGrumm-NOAA
Copy link
Collaborator Author

DavidGrumm-NOAA commented Dec 10, 2024

Removed extraneous white space from fetch.py and recommitted; still testing.

@DavidGrumm-NOAA
Copy link
Collaborator Author

DavidGrumm-NOAA commented Dec 16, 2024

I moved the fetch options to be in the run_options dict.

Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in parm/fetch/ATM_cold.yaml.j2

EDIT: added .j2 as this is Jinja-templated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the comment below, removing parm/fetch/ATM_cold.yaml.j2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, @DavidGrumm-NOAA. I was suggesting that this file should be moved to parm/fetch/ATM_cold.yaml.j2. This is the template for the fetch job, not for a CI test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making a copy to parm/fetch/ATM_cold.yaml.j2. This file can be deleted now with git rm ci/cases/pr/ATM_cold.yaml && git commit && git push.

parm/config/gefs/config.base Outdated Show resolved Hide resolved
parm/config/gefs/config.fetch Show resolved Hide resolved
scripts/exglobal_fetch.py Outdated Show resolved Hide resolved
ush/python/pygfs/task/fetch.py Outdated Show resolved Hide resolved
scripts/exglobal_fetch.py Outdated Show resolved Hide resolved
ci/cases/pr/ATM_cold.yaml Outdated Show resolved Hide resolved
@DavidHuber-NOAA DavidHuber-NOAA changed the title Stage_ic updates: GH2988 Add fetch job and update stage_ic to work with fetched ICs Dec 23, 2024
fetch = Fetch(config)

# Pull out all the configuration keys needed to run the fetch step
keys = ['current_cycle', 'RUN', 'PDY', 'PARMgfs', 'PSLOT', 'ROTDIR', 'fetch_yaml', 'FETCHDIR', 'ntiles', 'DATAROOT', 'cycle_YMDH']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, cycle_YMDH needs to be set in the jinja2 file.

Suggested change
keys = ['current_cycle', 'RUN', 'PDY', 'PARMgfs', 'PSLOT', 'ROTDIR', 'fetch_yaml', 'FETCHDIR', 'ntiles', 'DATAROOT', 'cycle_YMDH']
keys = ['current_cycle', 'RUN', 'PDY', 'PARMgfs', 'PSLOT', 'ROTDIR', 'fetch_yaml', 'FETCHDIR', 'ntiles', 'DATAROOT']

@DavidGrumm-NOAA
Copy link
Collaborator Author

DavidGrumm-NOAA commented Dec 23, 2024

Those changes fixed that ‘None’ error in the path, so it is now:

/NCEPDEV/emc-global/1year/David.Grumm/test_data/2021032312/atm_cold.tar

… but that directory ( /NCEPDEV/emc-global/1year/David.Grumm/test_data/2021032312/) does not seem to exist, so it still fails. From the log:

< snip >
024-12-23 20:30:21,057 - DEBUG - fetch : returning: {'untar': {'tarball': '/NCEPDEV/emc-global/1year/David.Grumm/test_data/2021032312/atm_cold.tar', 'on_hpss': True, 'contents': ['gfs_ctrl.nc', 'gfs_data.tile1.nc', 'gfs_data.tile2.nc', 'gfs_data.tile3.nc', 'gfs_data.tile4.nc', 'gfs_data.tile5.nc', 'gfs_data.tile6.nc', 'sfc_data.tile1.nc', 'sfc_data.tile2.nc', 'sfc_data.tile3.nc', 'sfc_data.tile4.nc', 'sfc_data.tile5.nc', 'sfc_data.tile6.nc'], 'destination': '/scratch1/NCEPDEV/stmp2/David.Grumm/RUNDIRS/pslot_3141/gfs.2021032312'}}
2024-12-23 20:30:21,057 - INFO - fetch : BEGIN: pygfs.task.fetch.execute_pull_data
2024-12-23 20:30:21,057 - DEBUG - fetch : ( <pygfs.task.fetch.Fetch object at 0x148321730290>, {'untar': {'tarball': '/NCEPDEV/emc-global/1year/David.Grumm/test_data/2021032312/atm_cold.tar', 'on_hpss': True, 'contents': ['gfs_ctrl.nc', 'gfs_data.tile1.nc', 'gfs_data.tile2.nc', 'gfs_data.tile3.nc', 'gfs_data.tile4.nc', 'gfs_data.tile5.nc', 'gfs_data.tile6.nc', 'sfc_data.tile1.nc', 'sfc_data.tile2.nc', 'sfc_data.tile3.nc', 'sfc_data.tile4.nc', 'sfc_data.tile5.nc', 'sfc_data.tile6.nc'], 'destination': '/scratch1/NCEPDEV/stmp2/David.Grumm/RUNDIRS/pslot_3141/gfs.2021032312'}} )
[connecting to hpsscore1.fairmont.rdhpcs.noaa.gov/1217]
ERROR: [FATAL] no such HPSS archive file: /NCEPDEV/emc-global/1year/David.Grumm/test_data/2021032312/atm_cold.tar
###WARNING htar returned non-zero exit status.
72 = /apps/hpss_hera/bin/htar -x -v -f /NCEPDEV/emc-global/1year/David.Grumm/test_data/2021032312/atm_cold.tar gfs_ctrl.nc gfs_data.tile1.nc gfs_data.tile2.nc gfs_data.tile3.nc gfs_data.tile4.nc gfs_data.tile5.nc gfs_data.tile6.nc sfc_data.tile1.nc sfc_data.tile2.nc sfc_data.tile3.nc sfc_data.tile4.nc sfc_data.tile5.nc sfc_data.tile6.nc
HTAR: HTAR FAILED
WARNING: Unable to chdir(/scratch1/NCEPDEV/stmp2/David.Grumm/RUNDIRS/pslot_3141/gfs.2021032312)
Traceback (most recent call last):
File "/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/scripts/exglobal_fetch.py", line 45, in
main()
File "/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/ush/python/wxflow/logger.py", line 266, in wrapper
retval = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/scripts/exglobal_fetch.py", line 41, in main
fetch.execute_pull_data(fetchdir_set)
File "/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/ush/python/wxflow/logger.py", line 266, in wrapper
retval = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/ush/python/pygfs/task/fetch.py", line 89, in execute_pull_data
htar_obj.xvf(tarball, f_names)
File "/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/ush/python/wxflow/htar.py", line 174, in xvf
output = self.extract(tarball, fileset, opts="-v")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/ush/python/wxflow/htar.py", line 153, in extract
output = self._htar(arg_list)
^^^^^^^^^^^^^^^^^^^^
File "/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/ush/python/wxflow/htar.py", line 53, in _htar
output = self.exe(*arg_list, output=str.split, error=str.split)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch1/NCEPDEV/global/David.Grumm/G_WF_2988/ush/python/wxflow/executable.py", line 230, in call
raise ProcessError(f"Command exited with status {proc.returncode}:", long_msg)
wxflow.executable.ProcessError: Command exited with status 72:
'/apps/hpss/htar' '-x' '-v' '-f' '/NCEPDEV/emc-global/1year/David.Grumm/test_data/2021032312/atm_cold.tar' 'gfs_ctrl.nc' 'gfs_data.tile1.nc' 'gfs_data.tile2.nc' 'gfs_data.tile3.nc' 'gfs_data.tile4.nc' 'gfs_data.tile5.nc' 'gfs_data.tile6.nc' 'sfc_data.tile1.nc' 'sfc_data.tile2.nc' 'sfc_data.tile3.nc' 'sfc_data.tile4.nc' 'sfc_data.tile5.nc' 'sfc_data.tile6.nc'
HTAR: HTAR FAILED
[connecting to hpsscore1.fairmont.rdhpcs.noaa.gov/1217]
ERROR: [FATAL] no such HPSS archive file: /NCEPDEV/emc-global/1year/David.Grumm/test_data/2021032312/atm_cold.tar
###WARNING htar returned non-zero exit status.
72 = /apps/hpss_hera/bin/htar -x -v -f /NCEPDEV/emc-global/1year/David.Grumm/test_data/2021032312/atm_cold.tar gfs_ctrl.nc gfs_data.tile1.nc gfs_data.tile2.nc gfs_data.tile3.nc gfs_data.tile4.nc gfs_data.tile5.nc gfs_data.tile6.nc sfc_data.tile1.nc sfc_data.tile2.nc sfc_data.tile3.nc sfc_data.tile4.nc sfc_data.tile5.nc sfc_data.tile6.nc

  • JGLOBAL_FETCH[1]: postamble JGLOBAL_FETCH 1734985804 1
  • preamble.sh[70]: set +x
    End JGLOBAL_FETCH at 20:30:22 with error code 1 (time elapsed: 00:00:18)
  • fetch.sh[1]: postamble fetch.sh 1734985800 1
  • preamble.sh[70]: set +x
    End fetch.sh at 20:30:22 with error code 1 (time elapsed: 00:00:22)

Start Epilog on node hfe01 for job 4228191 :: Mon Dec 23 20:30:23 UTC 2024
Job 4228191 finished for user David.Grumm in partition service with exit code 1:0


End Epilogue Mon Dec 23 20:30:23 UTC 2024

I will look into the intended creation of this directory.

@DavidGrumm-NOAA
Copy link
Collaborator Author

The current file creation is correct - the error (“no such HPSS archive file”) is resolved by locating the tarball in the correct directory. There currently is a warning for the chdir call for htar/tar which I’m investigating.

fetch = Fetch(config)

# Pull out all the configuration keys needed to run the fetch step
keys = ['current_cycle', 'RUN', 'PDY', 'PARMgfs', 'PSLOT', 'ROTDIR', 'fetch_yaml', 'FETCHDIR', 'ntiles', 'DATAROOT']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.fetch sets the name of the fetch job Jinja-YAML to FETCH_YAML_TMPL. This is the environmental variable that exglobal_fetch.py should be looking for.

Suggested change
keys = ['current_cycle', 'RUN', 'PDY', 'PARMgfs', 'PSLOT', 'ROTDIR', 'fetch_yaml', 'FETCHDIR', 'ntiles', 'DATAROOT']
keys = ['current_cycle', 'RUN', 'PDY', 'PARMgfs', 'PSLOT', 'ROTDIR', 'FETCH_YAML_TMPL', 'FETCHDIR', 'ntiles', 'DATAROOT']

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


echo "BEGIN: config.fetch"

export FETCH_YAML_TMPL="${PARMgfs}/fetch/C48_cold.yaml.j2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the Jinja-YAML is ATM_cold.yaml.j2.

Suggested change
export FETCH_YAML_TMPL="${PARMgfs}/fetch/C48_cold.yaml.j2"
export FETCH_YAML_TMPL="${PARMgfs}/fetch/ATM_cold.yaml.j2"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 29 to 35
# Also import all COMOUT* directory and template variables
for key in fetch.task_config.keys():
if key.startswith("COMOUT_"):
fetch_dict[key] = fetch.task_config.get(key)
if fetch_dict[key] is None:
print(f"Warning: key ({key}) not found in task_config!")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, the fetch task doesn't interact with the COM directories, so I think this can be deleted.

Suggested change
# Also import all COMOUT* directory and template variables
for key in fetch.task_config.keys():
if key.startswith("COMOUT_"):
fetch_dict[key] = fetch.task_config.get(key)
if fetch_dict[key] is None:
print(f"Warning: key ({key}) not found in task_config!")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if self.options['do_fetch_hpss'] or self.options['do_fetch_local']:
deps = []
dep_dict = {
'type': 'task', 'name': f'fetch',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task name should also have the RUN in it:

Suggested change
'type': 'task', 'name': f'fetch',
'type': 'task', 'name': f'{self.run}_fetch',

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -10,7 +10,7 @@


class Tasks:
SERVICE_TASKS = ['arch', 'earc', 'stage_ic', 'cleanup']
SERVICE_TASKS = ['arch', 'earc', 'stage_ic', 'fetch', 'cleanup']
VALID_TASKS = ['aerosol_init', 'stage_ic',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch should also be added to the VALID_TASKS list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""
self.hsi = Hsi()

fetch_yaml = fetch_dict.fetch_yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be FETCH_YAML_TMPL:

Suggested change
fetch_yaml = fetch_dict.fetch_yaml
fetch_yaml = fetch_dict.FETCH_YAML_TMPL

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants