Add Zarr Encoder #38

rmclaren · 2024-12-10T21:43:08Z

This pull request adds a new encoder for the ZARR format. Here is an example of what the structure of the ZARR file looks like:

The zarr file has the following basic structure:

root
    dimensions:
        Location
        Channel
     MetaData
        datetime
        latitude
        longitude
        ...etc...
     ObsValue
         brightnessTemperature

The global information is added to the root groups attributes.

The special attribute _ARRAY_DIMENSIONS is added for each dataset which maps the data to the associated dimension arrays.

…f encoder.

CoryMartin-NOAA

couple of comments on documentation but otherwise looks good.

General design comment / question. Can this be generalized such that one can do:

BUFR to IODA
BUFR to ZARR
ZARR to IODA
ZARR to BUFR
IODA to BUFR
IODA to ZARR
?
Just a thought. I'm not sure if all of those combos would be needed but I imagine there will be a need to write BUFR for some purpose to disseminate to WMO? @emilyhcliu any thoughts?

docs/yaml.rst

CoryMartin-NOAA · 2024-12-10T21:53:35Z

docs/yaml.rst

@@ -134,13 +133,11 @@ It has the following sub-sections:
 Encoder Description
 ~~~~~~~~~~~~~~~~

-The **ioda** section defines the ObsGroup objects that will be created. Here is an example:
+The **encoder** section defines the ObsGroup objects that will be created. Here is an example:

 .. code-block:: yaml

  encoder:


Do you not specify in the YAML which encoder you want to use?

Same question.

encoder: type: netcdf dimensions: globals: variables:

Currently the type does not have any effect for netcdf or ioda format.
Do we need to specify type to zarr when the output is zarr?

The description for the encoder is currently not dependent on the encoder type. This is partially due to the fact that all the output formats are similar to HDF5. Currently no need to indicate the output type in the YAML file (the spec is the same).

There is no ability to output BUFR files of any kind... This would be a HUGE deal, probably leading us to dump the dependency on NCEPLib-bufr.

The IODA Encoder for bufr is part of the IODA project, and is not documented here...

@emilyhcliu The type attribute is not used or needed at all.

ok thanks @rmclaren for the clarification

emilyhcliu · 2024-12-10T22:11:01Z

docs/yaml.rst

@@ -134,13 +133,11 @@ It has the following sub-sections:
 Encoder Description
 ~~~~~~~~~~~~~~~~

-The **ioda** section defines the ObsGroup objects that will be created. Here is an example:
+The **encoder** section defines the ObsGroup objects that will be created. Here is an example:

 .. code-block:: yaml

  encoder:


Same question.

encoder: type: netcdf dimensions: globals: variables:

Currently the type does not have any effect for netcdf or ioda format.
Do we need to specify type to zarr when the output is zarr?

emilyhcliu · 2024-12-10T22:20:36Z

python/py_encoder_description.cpp

@@ -16,13 +20,140 @@ namespace py = pybind11;

 using bufr::encoders::Description;

+
+template<typename T>
+class PyGlobalWriter : public bufr::encoders::GlobalWriter<T>


@rmclaren Here, are you adding a Python interface to write global attributes?

@emilyhcliu No need too specify zarr, as the description between the different outputs is the same.

@emilyhcliu Yes exactly. The global attributes are objects with different types (specified in the yaml file). You need to supply it with a writer object so it can write its value. This implementation of the writer writes the value into a python dictionary. This machinery allows the C++ compiler to determine the correct types at compile time.

ilianagenkova

My comments are mostly to understand the code.

Cloned it but couldn't run -
source /home/Emily.Liu/modules/env_jedi.sh
python bufr2zarr.py /TEMP/bufr-query/tools/bufr
Traceback (most recent call last):
File "/scratch1/NCEPDEV/da/Iliana.Genkova/TEMP/bufr-query/tools/bufr2zarr/bufr2zarr.py", line 8, in
import bufr
ModuleNotFoundError: No module named 'bufr'

Likely my environment is not set right.

docs/yaml.rst

ilianagenkova · 2024-12-10T22:18:26Z

python/bufr/encoders/zarr/encoder.py

+import re
+from typing import Union
+
+import zarr


Is "zarr" off-the-shelf/standard Python package?

It must be installed separatly (pip install zarr). The python package is the official implementation for that format.

ilianagenkova · 2024-12-10T22:23:48Z

python/py_mpi.cpp

@@ -45,5 +45,6 @@ void setupMpi(py::module& m)
  py::class_<bufr::mpi::Comm>(m, "Comm")
    .def(py::init<const std::string&>())
    .def("name", &bufr::mpi::Comm::name)
-    .def("rank", &bufr::mpi::Comm::rank);
+    .def("rank", &bufr::mpi::Comm::rank)
+    .def("size", &bufr::mpi::Comm::size);


How was "size" not necessary before?

I just never ended up using it, so never noticed it was missing. Should have been there all along..

I don't know where it could be done, but some documentations of the difference between "rank" and "size" might be helpful.

test/testinput/bufrtest_python_test.py

emilyhcliu · 2024-12-10T22:42:10Z

tools/bufr2zarr/bufr2zarr.py

+    else:
+        container = bufr.Parser(data_path, mapping_path).parse()
+
+    if comm.rank() == 0:


@rmclaren No executable for bufr2zarr. The data will be encoded through zarr.Encoder in Python, correct?

dataset = next(iter(zarr.Encoder(YAML_PATH).encode(container, OUTPUT_PATH).values()))

bufr2zarr.py is installed in the bin directory and can be treated just like bufr2netcdf.x. It even takes the same arguments. The zarr encoder is implemented in python as the zarr library has no c/c++ interface, that is true.

emilyhcliu · 2024-12-10T22:47:43Z

couple of comments on documentation but otherwise looks good.

General design comment / question. Can this be generalized such that one can do:

BUFR to IODA BUFR to ZARR ZARR to IODA ZARR to BUFR IODA to BUFR IODA to ZARR ? Just a thought. I'm not sure if all of those combos would be needed but I imagine there will be a need to write BUFR for some purpose to disseminate to WMO? @emilyhcliu any thoughts?
@CoryMartin-NOAA Do you specifically mean the WDQMS to WMO?
If it is WDQMS, we do not need to convert ZARR/IODA/NETCDF to BUFR. WDQMS requires the conversion of any NWP native diagnostic files into a data template (selected output/statistics from NWP diagnostics) in CSV format.

CoryMartin-NOAA · 2024-12-11T14:06:05Z

@emilyhcliu I was thinking more like will there be a need to have something 'like prepBUFR' written out, I guess I misspoke with the WMO comment.

rmclaren · 2024-12-11T15:11:43Z

@ilianagenkova It sounds like you need to extend your python path too point at the correct directory. So something like this:

export PYTHONPATH=<my project dir>/bufr-query/build/lib/python3.12/site-packages:$PYTHONPATH

emilyhcliu · 2024-12-11T18:35:55Z

@emilyhcliu I was thinking more like will there be a need to have something 'like prepBUFR' written out, I guess I misspoke with the WMO comment.

@CoryMartin-NOAA I hope there is no need for this in the future :-(. I will ask Daryl if writing out data (obs, analysis, ....etc) to BUFR again is required.

ilianagenkova · 2024-12-11T19:43:22Z

@emilyhcliu I was thinking more like will there be a need to have something 'like prepBUFR' written out, I guess I misspoke with the WMO comment.

@CoryMartin-NOAA I hope there is no need for this in the future :-(. I will ask Daryl if writing out data (obs, analysis, ....etc) to BUFR again is required.

I believe the BUFR format is related to disseminating data via GTS.
GTS will be gradually replaced by WIS2 (starting Jan 2025), but BUFR may stay around as the WMO preferred format.
Let me read the latest WIS2 docs and will update you.

rmclaren added 10 commits December 9, 2024 01:25

added basic zarr encoder

632b4d1

fixed global string fields

2ea274c

working on dimensions. Gloobals now attributes

5762a8a

Added dimensions to the zarr file

6cb67b5

update to zarr encooder dimensions.. now putting in dimensions group

4364cad

now picking the data associated with the last dim of the source

395c96f

Added bufr2zarr script.

b2b1ad6

added ability to set chunk size, compression level, range

fe1ef01

fixed bad yaml example in docs

45405dd

zarr encode now returns a dict of encoded objects just like the netcd…

babb585

…f encoder.

rmclaren requested review from CoryMartin-NOAA, emilyhcliu and ilianagenkova December 10, 2024 21:43

rmclaren self-assigned this Dec 10, 2024

rmclaren added the enhancement New feature or request label Dec 10, 2024

rmclaren requested review from givelberg and azadeh-gh December 10, 2024 21:46

rmclaren mentioned this pull request Dec 10, 2024

Documentation Fixes (ReadTheDocs) #35

Open

CoryMartin-NOAA reviewed Dec 10, 2024

View reviewed changes

emilyhcliu reviewed Dec 10, 2024

View reviewed changes

ilianagenkova approved these changes Dec 10, 2024

View reviewed changes

emilyhcliu reviewed Dec 10, 2024

View reviewed changes

rmclaren added 2 commits December 10, 2024 18:50

yaml.rst doc updates

6d9ea8c

added ilianas suggestion foor wording in yaml.rst

3566fac

rmclaren added 2 commits December 11, 2024 11:42

small improvments

217b8b9

forgot to add coordinates as an attribute to a dataset... fixed

af3f409

fixed prototype path handling code in zarr encoder

3586ec7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Zarr Encoder #38

Add Zarr Encoder #38

rmclaren commented Dec 10, 2024

CoryMartin-NOAA left a comment

CoryMartin-NOAA Dec 10, 2024

emilyhcliu Dec 10, 2024

rmclaren Dec 10, 2024

rmclaren Dec 10, 2024

CoryMartin-NOAA Dec 11, 2024

emilyhcliu Dec 10, 2024

emilyhcliu Dec 10, 2024

rmclaren Dec 10, 2024

rmclaren Dec 11, 2024

ilianagenkova left a comment

ilianagenkova Dec 10, 2024

rmclaren Dec 10, 2024

ilianagenkova Dec 10, 2024

rmclaren Dec 10, 2024

ilianagenkova Dec 11, 2024

emilyhcliu Dec 10, 2024

rmclaren Dec 11, 2024

emilyhcliu commented Dec 10, 2024 •

edited

Loading

CoryMartin-NOAA commented Dec 11, 2024

rmclaren commented Dec 11, 2024 •

edited

Loading

emilyhcliu commented Dec 11, 2024

ilianagenkova commented Dec 11, 2024

Add Zarr Encoder #38

Are you sure you want to change the base?

Add Zarr Encoder #38

Conversation

rmclaren commented Dec 10, 2024

CoryMartin-NOAA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilianagenkova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emilyhcliu commented Dec 10, 2024 • edited Loading

CoryMartin-NOAA commented Dec 11, 2024

rmclaren commented Dec 11, 2024 • edited Loading

emilyhcliu commented Dec 11, 2024

ilianagenkova commented Dec 11, 2024

emilyhcliu commented Dec 10, 2024 •

edited

Loading

rmclaren commented Dec 11, 2024 •

edited

Loading