Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Zarr Encoder #38

Open
wants to merge 15 commits into
base: develop
Choose a base branch
from
Open

Add Zarr Encoder #38

wants to merge 15 commits into from

Conversation

rmclaren
Copy link
Collaborator

This pull request adds a new encoder for the ZARR format. Here is an example of what the structure of the ZARR file looks like:

The zarr file has the following basic structure:

root
    dimensions:
        Location
        Channel
     MetaData
        datetime
        latitude
        longitude
        ...etc...
     ObsValue
         brightnessTemperature

The global information is added to the root groups attributes.

The special attribute _ARRAY_DIMENSIONS is added for each dataset which maps the data to the associated dimension arrays.

@rmclaren rmclaren self-assigned this Dec 10, 2024
@rmclaren rmclaren added the enhancement New feature or request label Dec 10, 2024
Copy link
Collaborator

@CoryMartin-NOAA CoryMartin-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments on documentation but otherwise looks good.

General design comment / question. Can this be generalized such that one can do:

BUFR to IODA
BUFR to ZARR
ZARR to IODA
ZARR to BUFR
IODA to BUFR
IODA to ZARR
?
Just a thought. I'm not sure if all of those combos would be needed but I imagine there will be a need to write BUFR for some purpose to disseminate to WMO? @emilyhcliu any thoughts?

docs/yaml.rst Outdated Show resolved Hide resolved
@@ -134,13 +133,11 @@ It has the following sub-sections:
Encoder Description
~~~~~~~~~~~~~~~~

The **ioda** section defines the ObsGroup objects that will be created. Here is an example:
The **encoder** section defines the ObsGroup objects that will be created. Here is an example:

.. code-block:: yaml

encoder:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you not specify in the YAML which encoder you want to use?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

encoder:
  type: netcdf
  dimensions:
  globals:
  variables:

Currently the type does not have any effect for netcdf or ioda format.
Do we need to specify type to zarr when the output is zarr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for the encoder is currently not dependent on the encoder type. This is partially due to the fact that all the output formats are similar to HDF5. Currently no need to indicate the output type in the YAML file (the spec is the same).

There is no ability to output BUFR files of any kind... This would be a HUGE deal, probably leading us to dump the dependency on NCEPLib-bufr.

The IODA Encoder for bufr is part of the IODA project, and is not documented here...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emilyhcliu The type attribute is not used or needed at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks @rmclaren for the clarification

@@ -134,13 +133,11 @@ It has the following sub-sections:
Encoder Description
~~~~~~~~~~~~~~~~

The **ioda** section defines the ObsGroup objects that will be created. Here is an example:
The **encoder** section defines the ObsGroup objects that will be created. Here is an example:

.. code-block:: yaml

encoder:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

encoder:
  type: netcdf
  dimensions:
  globals:
  variables:

Currently the type does not have any effect for netcdf or ioda format.
Do we need to specify type to zarr when the output is zarr?

@@ -16,13 +20,140 @@ namespace py = pybind11;

using bufr::encoders::Description;


template<typename T>
class PyGlobalWriter : public bufr::encoders::GlobalWriter<T>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmclaren Here, are you adding a Python interface to write global attributes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emilyhcliu No need too specify zarr, as the description between the different outputs is the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emilyhcliu Yes exactly. The global attributes are objects with different types (specified in the yaml file). You need to supply it with a writer object so it can write its value. This implementation of the writer writes the value into a python dictionary. This machinery allows the C++ compiler to determine the correct types at compile time.

Copy link
Collaborator

@ilianagenkova ilianagenkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are mostly to understand the code.

Cloned it but couldn't run -
source /home/Emily.Liu/modules/env_jedi.sh
python bufr2zarr.py /TEMP/bufr-query/tools/bufr
Traceback (most recent call last):
File "/scratch1/NCEPDEV/da/Iliana.Genkova/TEMP/bufr-query/tools/bufr2zarr/bufr2zarr.py", line 8, in
import bufr
ModuleNotFoundError: No module named 'bufr'

Likely my environment is not set right.

docs/yaml.rst Outdated Show resolved Hide resolved
import re
from typing import Union

import zarr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "zarr" off-the-shelf/standard Python package?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be installed separatly (pip install zarr). The python package is the official implementation for that format.

@@ -45,5 +45,6 @@ void setupMpi(py::module& m)
py::class_<bufr::mpi::Comm>(m, "Comm")
.def(py::init<const std::string&>())
.def("name", &bufr::mpi::Comm::name)
.def("rank", &bufr::mpi::Comm::rank);
.def("rank", &bufr::mpi::Comm::rank)
.def("size", &bufr::mpi::Comm::size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was "size" not necessary before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just never ended up using it, so never noticed it was missing. Should have been there all along..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know where it could be done, but some documentations of the difference between "rank" and "size" might be helpful.

test/testinput/bufrtest_python_test.py Show resolved Hide resolved
else:
container = bufr.Parser(data_path, mapping_path).parse()

if comm.rank() == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmclaren No executable for bufr2zarr. The data will be encoded through zarr.Encoder in Python, correct?

dataset = next(iter(zarr.Encoder(YAML_PATH).encode(container, OUTPUT_PATH).values()))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bufr2zarr.py is installed in the bin directory and can be treated just like bufr2netcdf.x. It even takes the same arguments. The zarr encoder is implemented in python as the zarr library has no c/c++ interface, that is true.

@emilyhcliu
Copy link
Collaborator

emilyhcliu commented Dec 10, 2024

couple of comments on documentation but otherwise looks good.

General design comment / question. Can this be generalized such that one can do:

BUFR to IODA BUFR to ZARR ZARR to IODA ZARR to BUFR IODA to BUFR IODA to ZARR ? Just a thought. I'm not sure if all of those combos would be needed but I imagine there will be a need to write BUFR for some purpose to disseminate to WMO? @emilyhcliu any thoughts?
@CoryMartin-NOAA Do you specifically mean the WDQMS to WMO?
If it is WDQMS, we do not need to convert ZARR/IODA/NETCDF to BUFR. WDQMS requires the conversion of any NWP native diagnostic files into a data template (selected output/statistics from NWP diagnostics) in CSV format.

@CoryMartin-NOAA
Copy link
Collaborator

@emilyhcliu I was thinking more like will there be a need to have something 'like prepBUFR' written out, I guess I misspoke with the WMO comment.

@rmclaren
Copy link
Collaborator Author

rmclaren commented Dec 11, 2024

@ilianagenkova It sounds like you need to extend your python path too point at the correct directory. So something like this:

export PYTHONPATH=<my project dir>/bufr-query/build/lib/python3.12/site-packages:$PYTHONPATH

@emilyhcliu
Copy link
Collaborator

@emilyhcliu I was thinking more like will there be a need to have something 'like prepBUFR' written out, I guess I misspoke with the WMO comment.

@CoryMartin-NOAA I hope there is no need for this in the future :-(. I will ask Daryl if writing out data (obs, analysis, ....etc) to BUFR again is required.

@ilianagenkova
Copy link
Collaborator

@emilyhcliu I was thinking more like will there be a need to have something 'like prepBUFR' written out, I guess I misspoke with the WMO comment.

@CoryMartin-NOAA I hope there is no need for this in the future :-(. I will ask Daryl if writing out data (obs, analysis, ....etc) to BUFR again is required.

I believe the BUFR format is related to disseminating data via GTS.
GTS will be gradually replaced by WIS2 (starting Jan 2025), but BUFR may stay around as the WMO preferred format.
Let me read the latest WIS2 docs and will update you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants