JSON/TOML backend: introduce abbreviated IO modes (openPMD#1493)

* Introduce dataset template mode to JSON backend * Write used mode to JSON file * Use Attribute::getOptional for snapshot attribute * Introduce attribute mode * Add example 14_toml_template.cpp * Use Datatype::UNDEFINED to indicate no dataset definition in template * Extend example * Test short attribute mode * Copy datatypeToString to JSON implementation * Fix after rebase: Init JSON config in parallel mode * Fix after rebase: Don't erase JSON datasets when writing * openpmd-pipe: use short modes for test * Less intrusive warnings, allow disabling them * TOML: Use short modes by default * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Documentation * Short mode in default in openPMD >= 2. * Short value by default in TOML * Store the openPMD version information in the IOHandler * Fixes * Adapt test to recent rebase Reading the chunk table requires NOT using template mode, otherwise the string just consists of '\0' bytes. * toml11 4.0 compatibility * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip: cleanup * wip: cleanup * Cleanup * Extensive testing --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
franzpoeschel · Dec 16, 2024 · 2e246cc · 2e246cc
1 parent c639257
commit 2e246cc
Show file tree

Hide file tree

Showing 18 changed files with 1,560 additions and 160 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -703,6 +703,7 @@ set(openPMD_EXAMPLE_NAMES
     10_streaming_read
     12_span_write
     13_write_dynamic_configuration
+    14_toml_template
 )
 set(openPMD_PYTHON_EXAMPLE_NAMES
     2_read_serial
@@ -1327,6 +1328,9 @@ if(openPMD_BUILD_TESTING)
                             ${openPMD_RUNTIME_OUTPUT_DIRECTORY}/openpmd-pipe       \
                             --infile ../samples/git-sample/thetaMode/data_%T.bp    \
                             --outfile ../samples/git-sample/thetaMode/data%T.json  \
+                            --outconfig '                                          \
+                                json.attribute.mode = \"short\"                  \n\
+                                json.dataset.mode = \"template_no_warn\"'          \
                         "
                     WORKING_DIRECTORY ${openPMD_RUNTIME_OUTPUT_DIRECTORY}
                 )

diff --git a/docs/source/backends/json.rst b/docs/source/backends/json.rst
@@ -38,20 +38,47 @@ when working with the JSON backend.
 Datasets and groups have the same namespace, meaning that there may not be a subgroup
 and a dataset with the same name contained in one group.
 
-Any **openPMD dataset** is a JSON object with three keys:
+Datasets
+........
 
- * ``attributes``: Attributes associated with the dataset. May be ``null`` or not present if no attributes are associated with the dataset.
- * ``datatype``: A string describing the type of the stored data.
- * ``data`` A nested array storing the actual data in row-major manner.
+Datasets can be stored in two modes, either as actual datasets or as dataset templates.
+The mode is selected by the :ref:`JSON/TOML parameter<backendconfig>` ``json.dataset.mode`` (resp. ``toml.dataset.mode``) with possible values ``["dataset", "template"]`` (default: ``"dataset"``).
+
+Stored as an actual dataset, an **openPMD dataset** is a JSON object with three JSON keys:
+
+ * ``datatype`` (required): A string describing the type of the stored data.
+ * ``data`` (required): A nested array storing the actual data in row-major manner.
    The data needs to be consistent with the fields ``datatype`` and ``extent``.
    Checking whether this key points to an array can be (and is internally) used to distinguish groups from datasets.
+ * ``attributes``: Attributes associated with the dataset. May be ``null`` or not present if no attributes are associated with the dataset.
+
+Stored as a **dataset template**, an openPMD dataset is represented by three JSON keys:
+
+ * ``datatype`` (required): As above.
+ * ``extent`` (required): A list of integers, describing the extent of the dataset.
+   This replaces the ``data`` key from the non-template representation.
+ * ``attributes``: As above.
 
-**Attributes** are stored as a JSON object with a key for each attribute.
+This mode stores only the dataset metadata.
+Chunk load/store operations are ignored.
+
+Attributes
+..........
+
+In order to avoid name clashes, attributes are generally stored within a separate subgroup ``attributes``.
+
+Attributes can be stored in two formats.
+The format is selected by the :ref:`JSON/TOML parameter<backendconfig>` ``json.attribute.mode`` (resp. ``toml.attribute.mode``) with possible values ``["long", "short"]`` (default: ``"long"`` for JSON in openPMD 1.*, ``"short"`` otherwise, i.e. generally in openPMD 2.*, but always in TOML).
+
+Attributes in **long format** store the datatype explicitly, by representing attributes as JSON objects.
 Every such attribute is itself a JSON object with two keys:
 
  * ``datatype``: A string describing the type of the value.
  * ``value``: The actual value of type ``datatype``.
 
+Attributes in **short format** are stored as just the simple value corresponding with the attribute.
+Since JSON/TOML values are pretty-printed into a human-readable format, byte-level type details can be lost when reading those values again later on (e.g. the distinction between different integer types).
+
 TOML File Format
 ----------------
 

diff --git a/docs/source/details/backendconfig.rst b/docs/source/details/backendconfig.rst
@@ -104,6 +104,8 @@ The key ``rank_table`` allows specifying the creation of a **rank table**, used
 Configuration Structure per Backend
 -----------------------------------
 
+Please refer to the respective backends' documentations for further information on their configuration.
+
 .. _backendconfig-adios2:
 
 ADIOS2
@@ -231,8 +233,21 @@ The parameters eligible for being passed to flush calls may be configured global
 
 .. _backendconfig-other:
 
-Other backends
-^^^^^^^^^^^^^^
+JSON/TOML
+^^^^^^^^^
 
-Do currently not read the configuration string.
-Please refer to the respective backends' documentations for further information on their configuration.
+A full configuration of the JSON backend:
+
+.. literalinclude:: json.json
+   :language: json
+
+The TOML backend is configured analogously, replacing the ``"json"`` key with ``"toml"``.
+
+All keys found under ``json.dataset`` are applicable globally as well as per dataset.
+Explanation of the single keys:
+
+* ``json.dataset.mode`` / ``toml.dataset.mode``: One of ``"dataset"`` (default) or ``"template"``.
+  In "dataset" mode, the dataset will be written as an n-dimensional (recursive) array, padded with nulls (JSON) or zeroes (TOML) for missing values.
+  In "template" mode, only the dataset metadata (type, extent and attributes) are stored and no chunks can be written or read (i.e. write/read operations will be skipped).
+* ``json.attribute.mode`` / ``toml.attribute.mode``: One of ``"long"`` (default in openPMD 1.*) or ``"short"`` (default in openPMD 2.* and generally in TOML).
+  The long format explicitly encodes the attribute type in the dataset on disk, the short format only writes the actual attribute as a JSON/TOML value, requiring readers to recover the type.
diff --git a/docs/source/details/json.json b/docs/source/details/json.json
@@ -0,0 +1,10 @@
+{
+  "json": {
+    "dataset": {
+      "mode": "template"
+    },
+    "attribute": {
+      "mode": "short"
+    }
+  }
+}
diff --git a/examples/14_toml_template.cpp b/examples/14_toml_template.cpp
@@ -0,0 +1,111 @@
+#include <openPMD/openPMD.hpp>
+
+std::string backendEnding()
+{
+    auto extensions = openPMD::getFileExtensions();
+    if (auto it = std::find(extensions.begin(), extensions.end(), "toml");
+        it != extensions.end())
+    {
+        return *it;
+    }
+    else
+    {
+        // Fallback for buggy old NVidia compiler
+        return "json";
+    }
+}
+
+void write()
+{
+    std::string config = R"(
+{
+  "iteration_encoding": "variable_based",
+  "json": {
+    "dataset": {"mode": "template"},
+    "attribute": {"mode": "short"}
+  },
+  "toml": {
+    "dataset": {"mode": "template"},
+    "attribute": {"mode": "short"}
+  }
+}
+)";
+
+    openPMD::Series writeTemplate(
+        "../samples/tomlTemplate." + backendEnding(),
+        openPMD::Access::CREATE,
+        config);
+    auto iteration = writeTemplate.writeIterations()[0];
+
+    openPMD::Dataset ds{openPMD::Datatype::FLOAT, {5, 5}};
+
+    auto temperature =
+        iteration.meshes["temperature"][openPMD::RecordComponent::SCALAR];
+    temperature.resetDataset(ds);
+
+    auto E = iteration.meshes["E"];
+    E["x"].resetDataset(ds);
+    E["y"].resetDataset(ds);
+    /*
+     * Don't specify datatype and extent for this one to indicate that this
+     * information is not yet known.
+     */
+    E["z"].resetDataset({});
+
+    ds.extent = {10};
+
+    auto electrons = iteration.particles["e"];
+    electrons["position"]["x"].resetDataset(ds);
+    electrons["position"]["y"].resetDataset(ds);
+    electrons["position"]["z"].resetDataset(ds);
+
+    electrons["positionOffset"]["x"].resetDataset(ds);
+    electrons["positionOffset"]["y"].resetDataset(ds);
+    electrons["positionOffset"]["z"].resetDataset(ds);
+    electrons["positionOffset"]["x"].makeConstant(3.14);
+    electrons["positionOffset"]["y"].makeConstant(3.14);
+    electrons["positionOffset"]["z"].makeConstant(3.14);
+
+    ds.dtype = openPMD::determineDatatype<uint64_t>();
+    electrons.particlePatches["numParticles"][openPMD::RecordComponent::SCALAR]
+        .resetDataset(ds);
+    electrons
+        .particlePatches["numParticlesOffset"][openPMD::RecordComponent::SCALAR]
+        .resetDataset(ds);
+    electrons.particlePatches["offset"]["x"].resetDataset(ds);
+    electrons.particlePatches["offset"]["y"].resetDataset(ds);
+    electrons.particlePatches["offset"]["z"].resetDataset(ds);
+    electrons.particlePatches["extent"]["x"].resetDataset(ds);
+    electrons.particlePatches["extent"]["y"].resetDataset(ds);
+    electrons.particlePatches["extent"]["z"].resetDataset(ds);
+}
+
+void read()
+{
+    /*
+     * The config is entirely optional, these things are also detected
+     * automatically when reading
+     */
+
+    // std::string config = R"(
+    // {
+    //   "iteration_encoding": "variable_based",
+    //   "toml": {
+    //     "dataset": {"mode": "template"},
+    //     "attribute": {"mode": "short"}
+    //   }
+    // }
+    // )";
+
+    openPMD::Series read(
+        "../samples/tomlTemplate." + backendEnding(),
+        openPMD::Access::READ_LINEAR);
+    read.parseBase();
+    openPMD::helper::listSeries(read);
+}
+
+int main()
+{
+    write();
+    read();
+}
diff --git a/include/openPMD/Dataset.hpp b/include/openPMD/Dataset.hpp
@@ -41,18 +41,40 @@ class Dataset
 public:
     enum : std::uint64_t
     {
-        JOINED_DIMENSION = std::numeric_limits<std::uint64_t>::max()
+        /**
+         * Setting one dimension of the extent as JOINED_DIMENSION means that
+         * the extent along that dimension will be defined by the sum of all
+         * parallel processes' contributions.
+         * Only one dimension can be joined. For store operations, the offset
+         * should be an empty array and the extent should give the actual
+         * extent of the chunk (i.e. the number of joined elements along the
+         * joined dimension, equal to the global extent in all other
+         * dimensions). For more details, refer to
+         * docs/source/usage/workflow.rst.
+         */
+        JOINED_DIMENSION = std::numeric_limits<std::uint64_t>::max(),
+        /**
+         * Some backends (i.e. JSON and TOML in template mode) support the
+         * creation of dataset with undefined datatype and extent.
+         * The extent should be given as {UNDEFINED_EXTENT} for that.
+         */
+        UNDEFINED_EXTENT = std::numeric_limits<std::uint64_t>::max() - 1
     };
 
     Dataset(Datatype, Extent, std::string options = "{}");
 
     /**
      * @brief Constructor that sets the datatype to undefined.
      *
-     * Helpful for resizing datasets, since datatypes need not be given twice.
+     * Helpful for:
+     *
+     * 1. Resizing datasets, since datatypes need not be given twice.
+     * 2. Initializing datasets as undefined, as used by template mode in the
+     *    JSON/TOML backend. In this case, the default (undefined) specification
+     *    for the Extent may be used.
      *
      */
-    Dataset(Extent);
+    Dataset(Extent = {UNDEFINED_EXTENT});
 
     Dataset &extend(Extent newExtent);
 

diff --git a/include/openPMD/IO/AbstractIOHandler.hpp b/include/openPMD/IO/AbstractIOHandler.hpp
@@ -201,6 +201,7 @@ class AbstractIOHandler
 {
     friend class Series;
     friend class ADIOS2IOHandlerImpl;
+    friend class JSONIOHandlerImpl;
     friend class detail::ADIOS2File;
 
 private:

diff --git a/include/openPMD/IO/JSON/JSONIOHandler.hpp b/include/openPMD/IO/JSON/JSONIOHandler.hpp
@@ -23,6 +23,7 @@
 
 #include "openPMD/IO/AbstractIOHandler.hpp"
 #include "openPMD/IO/JSON/JSONIOHandlerImpl.hpp"
+#include "openPMD/auxiliary/JSON_internal.hpp"
 
 #if openPMD_HAVE_MPI
 #include <mpi.h>