Remove warning when passing indexes=None (#357)

* remove warning * fix test by loading dimension coordinates * fix other test by passing loadable_variables * move loadable variables docs section before combining * remove recommendation to not create indexes * de-emphasise avoiding creating indexes * document using xr.combine_by_coords * clarify todo * signpost segue * remove extra line * release notes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * correct some PR numbers in release notes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refer to #18 in release notes --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
zarr-developers · Dec 18, 2024 · 3188ca0 · 3188ca0
1 parent 1474f18
commit 3188ca0
Show file tree

Hide file tree

Showing 6 changed files with 119 additions and 98 deletions.
diff --git a/docs/api.rst b/docs/api.rst
@@ -20,7 +20,6 @@ Reading
 
     open_virtual_dataset
 
-
 Serialization
 -------------
 

diff --git a/docs/releases.rst b/docs/releases.rst
@@ -14,7 +14,11 @@ Breaking changes
 
 - Passing ``group=None`` (the default) to ``open_virtual_dataset`` for a file with multiple groups no longer raises an error, instead it gives you the root group.
   This new behaviour is more consistent with ``xarray.open_dataset``.
-  (:issue:`336`, :pull:`337`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
+  (:issue:`336`, :pull:`338`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
+- Indexes are now created by default for any loadable one-dimensional coordinate variables.
+  Also a warning is no longer thrown when ``indexes=None`` is passed to ``open_virtual_dataset``, and the recommendations in the docs updated to match.
+  This also means that ``xarray.combine_by_coords`` will now work when the necessary dimension coordinates are specified in ``loadable_variables``.
+  (:issue:`18`, :pull:`357`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
 
 Deprecations
 ~~~~~~~~~~~~
@@ -23,7 +27,7 @@ Bug fixes
 ~~~~~~~~~
 
 - Fix bug preventing generating references for the root group of a file when a subgroup exists.
-  (:issue:`336`, :pull:`337`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
+  (:issue:`336`, :pull:`338`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
 
 Documentation
 ~~~~~~~~~~~~~

diff --git a/docs/usage.md b/docs/usage.md
@@ -184,6 +184,68 @@ The full Zarr model (for a single group) includes multiple arrays, array names,
 
 The problem of combining many archival format files (e.g. netCDF files) into one virtual Zarr store therefore becomes just a matter of opening each file using `open_virtual_dataset` and using [xarray's various combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html) to combine them into one aggregate virtual dataset.
 
+But before we combine our data, we might want to consider loading some variables into memory.
+
+## Loading variables
+
+Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:
+
+```python
+vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'], indexes={})
+```
+```python
+<xarray.Dataset> Size: 31MB
+Dimensions:  (time: 2920, lat: 25, lon: 53)
+Coordinates:
+    lat      (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
+    lon      (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
+  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
+  Data variables:
+    air      (time, lat, lon) float64 31MB ...
+Attributes:
+    Conventions:  COARDS
+    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
+    platform:     Model
+    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
+    title:        4x daily NMC reanalysis (1948)
+```
+You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).
+
+Loading variables can be useful in a few scenarios:
+1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
+2. You want in-memory indexes to use with ``xr.combine_by_coords``,
+3. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
+4. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.
+
+### CF-encoded time variables
+
+To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
+
+```python
+vds = open_virtual_dataset(
+    'air.nc',
+    loadable_variables=['air', 'time'],
+    decode_times=True,
+    indexes={},
+)
+```
+```python
+<xarray.Dataset> Size: 31MB
+Dimensions:  (time: 2920, lat: 25, lon: 53)
+Coordinates:
+    lat      (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
+    lon      (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
+    time     (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
+Data variables:
+    air      (time, lat, lon) float64 31MB ...
+Attributes:
+    Conventions:  COARDS
+    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
+    platform:     Model
+    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
+    title:        4x daily NMC reanalysis (1948)
+```
+
 ## Combining virtual datasets
 
 In general we should be able to combine all the datasets from our archival files into one using some combination of calls to `xarray.concat` and `xarray.merge`. For combining along multiple dimensions in one call we also have `xarray.combine_nested` and `xarray.combine_by_coords`. If you're not familiar with any of these functions we recommend you skim through [xarray's docs on combining](https://docs.xarray.dev/en/stable/user-guide/combining.html).
@@ -206,23 +268,9 @@ TODO: Note about variable-length chunking?
 
 The simplest case of concatenation is when you have a set of files and you know which order they should be concatenated in, _without looking inside the files_. In this case it is sufficient to open the files one-by-one, then pass the virtual datasets as a list to the concatenation function.
 
-We can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files, making our opening and combining much faster than it normally would be. **Therefore if you can do your combining manually you should.** However, you should first be confident that the archival files actually do have compatible data, as only the array shapes and dimension names will be checked for consistency.
-
-You can specify that you don't want any indexes to be created by passing `indexes={}` to `open_virtual_dataset`.
-
 ```python
-vds1 = open_virtual_dataset('air1.nc', indexes={})
-vds2 = open_virtual_dataset('air2.nc', indexes={})
-```
-
-We can see that the datasets have no indexes.
-
-```python
-vds1.indexes
-```
-```
-Indexes:
-    *empty*
+vds1 = open_virtual_dataset('air1.nc')
+vds2 = open_virtual_dataset('air2.nc')
 ```
 
 As we know the correct order a priori, we can just combine along one dimension using `xarray.concat`.
@@ -285,73 +333,37 @@ In future we would like for it to be possible to just use `xr.open_mfdataset` to
 but this requires some [upstream changes](https://github.com/TomNicholas/VirtualiZarr/issues/35) in xarray.
 ```
 
-### Automatic ordering using coordinate data
-
-TODO: Reinstate this part of the docs once [GH issue #18](https://github.com/TomNicholas/VirtualiZarr/issues/18#issuecomment-2023955860) is properly closed.
-
-### Automatic ordering using metadata
+```{note}
+For manual concatenation we can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files. However, you should first be confident that the archival files actually do have compatible data, as the coordinate values then cannot be efficiently compared for consistency (i.e. aligned).
 
-TODO: Use preprocess to create a new index from the metadata
+By default indexes are created for 1-dimensional ``loadable_variables`` whose name matches their only dimension (i.e. "dimension coordinates"), but if you wish you can load variables without creating any indexes by passing ``indexes={}`` to ``open_virtual_dataset``.
+```
 
-## Loading variables
+### Ordering by coordinate values
 
-Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:
+If you're happy to load 1D dimension coordinates into memory, you can use their values to do the ordering for you!
 
 ```python
-vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'], indexes={})
-```
-```python
-<xarray.Dataset> Size: 31MB
-Dimensions:  (time: 2920, lat: 25, lon: 53)
-Coordinates:
-    lat      (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
-    lon      (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
-  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
-  Data variables:
-    air      (time, lat, lon) float64 31MB ...
-Attributes:
-    Conventions:  COARDS
-    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
-    platform:     Model
-    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
-    title:        4x daily NMC reanalysis (1948)
-```
-You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).
-
-Loading variables can be useful in a few scenarios:
-1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
-2. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
-3. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.
+vds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'])
+vds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'])
 
-### CF-encoded time variables
+combined_vds = xr.combine_by_coords([vds2, vds1], coords='minimal', compat='override')
+```
 
-To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
+Notice we don't have to specify the concatenation dimension explicitly - xarray works out the correct ordering for us. Even though we actually passed in the virtual datasets in the wrong order just now, the manifest still has the chunks listed in the correct order such that the 1-dimensional ``time`` coordinate has ascending values:
 
 ```python
-vds = open_virtual_dataset(
-    'air.nc',
-    loadable_variables=['air', 'time'],
-    decode_times=True,
-    indexes={},
-)
+combined_vds['air'].data.manifest.dict()
 ```
-```python
-<xarray.Dataset> Size: 31MB
-Dimensions:  (time: 2920, lat: 25, lon: 53)
-Coordinates:
-    lat      (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
-    lon      (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
-    time     (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
-Data variables:
-    air      (time, lat, lon) float64 31MB ...
-Attributes:
-    Conventions:  COARDS
-    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
-    platform:     Model
-    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
-    title:        4x daily NMC reanalysis (1948)
+```
+{'0.0.0': {'path': 'file:///work/data/air1.nc', 'offset': 15419, 'length': 3869000},
+ '1.0.0': {'path': 'file:///work/data/air2.nc', 'offset': 15419, 'length': 3869000}}
 ```
 
+### Ordering using metadata
+
+TODO: Use preprocess to create a new index from the metadata. Requires ``open_virtual_mfdataset`` to be implemented in [PR #349](https://github.com/zarr-developers/VirtualiZarr/pull/349).
+
 ## Writing virtual stores to disk
 
 Once we've combined references to all the chunks of all our archival files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
@@ -439,9 +451,9 @@ This store can however be read by {py:func}`~virtualizarr.open_virtual_dataset`,
 You can open existing Kerchunk `json` or `parquet` references as Virtualizarr virtual datasets. This may be useful for converting existing Kerchunk formatted references to storage formats like [Icechunk](https://icechunk.io/).
 
 ```python
-vds = open_virtual_dataset('combined.json', filetype='kerchunk', indexes={})
+vds = open_virtual_dataset('combined.json', filetype='kerchunk')
 # or
-vds = open_virtual_dataset('combined.parquet', filetype='kerchunk', indexes={})
+vds = open_virtual_dataset('combined.parquet', filetype='kerchunk')
 ```
 
 One difference between the kerchunk references format and virtualizarr's internal manifest representation (as well as icechunk's format) is that paths in kerchunk references can be relative paths. Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the ``fsspec`` filesystem which the filepath was defined relative to.
@@ -454,7 +466,6 @@ You can dis-ambuiguate kerchunk references containing relative paths by passing
 vds = open_virtual_dataset(
     'relative_refs.json',
     filetype='kerchunk',
-    indexes={},
     virtual_backend_kwargs={'fs_root': 'file:///some_directory/'}
 )
 

diff --git a/virtualizarr/readers/common.py b/virtualizarr/readers/common.py
@@ -1,4 +1,3 @@
-import warnings
 from abc import ABC
 from collections.abc import Iterable, Mapping, MutableMapping
 from typing import (
@@ -55,13 +54,13 @@ def open_loadable_vars_and_indexes(
         )
 
         if indexes is None:
-            warnings.warn(
-                "Specifying `indexes=None` will create in-memory pandas indexes for each 1D coordinate, but concatenation of ManifestArrays backed by pandas indexes is not yet supported (see issue #18)."
-                "You almost certainly want to pass `indexes={}` to `open_virtual_dataset` instead."
-            )
-
             # add default indexes by reading data from file
-            indexes = {name: index for name, index in ds.xindexes.items()}
+            # but avoid creating an in-memory index for virtual variables by default
+            indexes = {
+                name: index
+                for name, index in ds.xindexes.items()
+                if name in loadable_variables
+            }
         elif indexes != {}:
             # TODO allow manual specification of index objects
             raise NotImplementedError()

diff --git a/virtualizarr/tests/test_backend.py b/virtualizarr/tests/test_backend.py
@@ -89,13 +89,21 @@ def test_no_indexes(self, netcdf4_file, hdf_backend):
         vds = open_virtual_dataset(netcdf4_file, indexes={}, backend=hdf_backend)
         assert vds.indexes == {}
 
-    def test_create_default_indexes(self, netcdf4_file, hdf_backend):
-        with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
-            vds = open_virtual_dataset(netcdf4_file, indexes=None, backend=hdf_backend)
+    def test_create_default_indexes_for_loadable_variables(
+        self, netcdf4_file, hdf_backend
+    ):
+        loadable_variables = ["time", "lat"]
+
+        vds = open_virtual_dataset(
+            netcdf4_file,
+            indexes=None,
+            backend=hdf_backend,
+            loadable_variables=loadable_variables,
+        )
         ds = open_dataset(netcdf4_file, decode_times=True)
 
         # TODO use xr.testing.assert_identical(vds.indexes, ds.indexes) instead once class supported by assertion comparison, see https://github.com/pydata/xarray/issues/5812
-        assert index_mappings_equal(vds.xindexes, ds.xindexes)
+        assert index_mappings_equal(vds.xindexes, ds[loadable_variables].xindexes)
 
 
 def index_mappings_equal(indexes1: Mapping[str, Index], indexes2: Mapping[str, Index]):

diff --git a/virtualizarr/tests/test_xarray.py b/virtualizarr/tests/test_xarray.py
@@ -6,7 +6,7 @@
 
 from virtualizarr import open_virtual_dataset
 from virtualizarr.manifests import ChunkManifest, ManifestArray
-from virtualizarr.readers.hdf import HDFVirtualBackend
+from virtualizarr.readers import HDF5VirtualBackend, HDFVirtualBackend
 from virtualizarr.tests import requires_kerchunk
 from virtualizarr.zarr import ZArray
 
@@ -227,15 +227,17 @@ def test_concat_dim_coords_along_existing_dim(self):
 
 
 @requires_kerchunk
-@pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend])
+@pytest.mark.parametrize("hdf_backend", [HDF5VirtualBackend, HDFVirtualBackend])
 class TestCombineUsingIndexes:
     def test_combine_by_coords(self, netcdf4_files_factory: Callable, hdf_backend):
         filepath1, filepath2 = netcdf4_files_factory()
 
-        with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
-            vds1 = open_virtual_dataset(filepath1, backend=hdf_backend)
-        with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
-            vds2 = open_virtual_dataset(filepath2, backend=hdf_backend)
+        vds1 = open_virtual_dataset(
+            filepath1, backend=hdf_backend, loadable_variables=["time", "lat", "lon"]
+        )
+        vds2 = open_virtual_dataset(
+            filepath2, backend=hdf_backend, loadable_variables=["time", "lat", "lon"]
+        )
 
         combined_vds = xr.combine_by_coords(
             [vds2, vds1],
@@ -247,10 +249,8 @@ def test_combine_by_coords(self, netcdf4_files_factory: Callable, hdf_backend):
     def test_combine_by_coords_keeping_manifestarrays(self, netcdf4_files, hdf_backend):
         filepath1, filepath2 = netcdf4_files
 
-        with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
-            vds1 = open_virtual_dataset(filepath1, backend=hdf_backend)
-        with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
-            vds2 = open_virtual_dataset(filepath2, backend=hdf_backend)
+        vds1 = open_virtual_dataset(filepath1, backend=hdf_backend)
+        vds2 = open_virtual_dataset(filepath2, backend=hdf_backend)
 
         combined_vds = xr.combine_by_coords(
             [vds2, vds1],
Original file line number	Diff line number	Diff line change
Expand Up		@@ -20,7 +20,6 @@ Reading

		open_virtual_dataset


		Serialization
		-------------

Expand Down