Skip to content

Commit

Permalink
Remove warning when passing indexes=None (#357)
Browse files Browse the repository at this point in the history
* remove warning

* fix test by loading dimension coordinates

* fix other test by passing loadable_variables

* move loadable variables docs section before combining

* remove recommendation to not create indexes

* de-emphasise avoiding creating indexes

* document using xr.combine_by_coords

* clarify todo

* signpost segue

* remove extra line

* release notes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* correct some PR numbers in release notes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refer to #18 in release notes

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
TomNicholas and pre-commit-ci[bot] authored Dec 18, 2024
1 parent 1474f18 commit 3188ca0
Show file tree
Hide file tree
Showing 6 changed files with 119 additions and 98 deletions.
1 change: 0 additions & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ Reading

open_virtual_dataset


Serialization
-------------

Expand Down
8 changes: 6 additions & 2 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,11 @@ Breaking changes

- Passing ``group=None`` (the default) to ``open_virtual_dataset`` for a file with multiple groups no longer raises an error, instead it gives you the root group.
This new behaviour is more consistent with ``xarray.open_dataset``.
(:issue:`336`, :pull:`337`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
(:issue:`336`, :pull:`338`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Indexes are now created by default for any loadable one-dimensional coordinate variables.
Also a warning is no longer thrown when ``indexes=None`` is passed to ``open_virtual_dataset``, and the recommendations in the docs updated to match.
This also means that ``xarray.combine_by_coords`` will now work when the necessary dimension coordinates are specified in ``loadable_variables``.
(:issue:`18`, :pull:`357`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

Deprecations
~~~~~~~~~~~~
Expand All @@ -23,7 +27,7 @@ Bug fixes
~~~~~~~~~

- Fix bug preventing generating references for the root group of a file when a subgroup exists.
(:issue:`336`, :pull:`337`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
(:issue:`336`, :pull:`338`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

Documentation
~~~~~~~~~~~~~
Expand Down
159 changes: 85 additions & 74 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,68 @@ The full Zarr model (for a single group) includes multiple arrays, array names,

The problem of combining many archival format files (e.g. netCDF files) into one virtual Zarr store therefore becomes just a matter of opening each file using `open_virtual_dataset` and using [xarray's various combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html) to combine them into one aggregate virtual dataset.

But before we combine our data, we might want to consider loading some variables into memory.

## Loading variables

Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:

```python
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'], indexes={})
```
```python
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```
You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).

Loading variables can be useful in a few scenarios:
1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
2. You want in-memory indexes to use with ``xr.combine_by_coords``,
3. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
4. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.

### CF-encoded time variables

To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).

```python
vds = open_virtual_dataset(
'air.nc',
loadable_variables=['air', 'time'],
decode_times=True,
indexes={},
)
```
```python
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
time (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```

## Combining virtual datasets

In general we should be able to combine all the datasets from our archival files into one using some combination of calls to `xarray.concat` and `xarray.merge`. For combining along multiple dimensions in one call we also have `xarray.combine_nested` and `xarray.combine_by_coords`. If you're not familiar with any of these functions we recommend you skim through [xarray's docs on combining](https://docs.xarray.dev/en/stable/user-guide/combining.html).
Expand All @@ -206,23 +268,9 @@ TODO: Note about variable-length chunking?

The simplest case of concatenation is when you have a set of files and you know which order they should be concatenated in, _without looking inside the files_. In this case it is sufficient to open the files one-by-one, then pass the virtual datasets as a list to the concatenation function.

We can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files, making our opening and combining much faster than it normally would be. **Therefore if you can do your combining manually you should.** However, you should first be confident that the archival files actually do have compatible data, as only the array shapes and dimension names will be checked for consistency.

You can specify that you don't want any indexes to be created by passing `indexes={}` to `open_virtual_dataset`.

```python
vds1 = open_virtual_dataset('air1.nc', indexes={})
vds2 = open_virtual_dataset('air2.nc', indexes={})
```

We can see that the datasets have no indexes.

```python
vds1.indexes
```
```
Indexes:
*empty*
vds1 = open_virtual_dataset('air1.nc')
vds2 = open_virtual_dataset('air2.nc')
```

As we know the correct order a priori, we can just combine along one dimension using `xarray.concat`.
Expand Down Expand Up @@ -285,73 +333,37 @@ In future we would like for it to be possible to just use `xr.open_mfdataset` to
but this requires some [upstream changes](https://github.com/TomNicholas/VirtualiZarr/issues/35) in xarray.
```

### Automatic ordering using coordinate data

TODO: Reinstate this part of the docs once [GH issue #18](https://github.com/TomNicholas/VirtualiZarr/issues/18#issuecomment-2023955860) is properly closed.

### Automatic ordering using metadata
```{note}
For manual concatenation we can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files. However, you should first be confident that the archival files actually do have compatible data, as the coordinate values then cannot be efficiently compared for consistency (i.e. aligned).
TODO: Use preprocess to create a new index from the metadata
By default indexes are created for 1-dimensional ``loadable_variables`` whose name matches their only dimension (i.e. "dimension coordinates"), but if you wish you can load variables without creating any indexes by passing ``indexes={}`` to ``open_virtual_dataset``.
```

## Loading variables
### Ordering by coordinate values

Whilst the values of virtual variables (i.e. those backed by `ManifestArray` objects) cannot be loaded into memory, you do have the option of opening specific variables from the file as loadable lazy numpy/dask arrays, just like `xr.open_dataset` normally returns. These variables are specified using the `loadable_variables` argument:
If you're happy to load 1D dimension coordinates into memory, you can use their values to do the ordering for you!

```python
vds = open_virtual_dataset('air.nc', loadable_variables=['air', 'time'], indexes={})
```
```python
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```
You can see that the dataset contains a mixture of virtual variables backed by `ManifestArray` objects (`lat` and `lon`), and loadable variables backed by (lazy) numpy arrays (`air` and `time`).

Loading variables can be useful in a few scenarios:
1. You need to look at the actual values of a multi-dimensional variable in order to decide what to do next,
2. Storing a variable on-disk as a set of references would be inefficient, e.g. because it's a very small array (saving the values like this is similar to kerchunk's concept of "inlining" data),
3. The variable has encoding, and the simplest way to decode it correctly is to let xarray's standard decoding machinery load it into memory and apply the decoding.
vds1 = open_virtual_dataset('air1.nc', loadable_variables=['time', 'lat', 'lon'])
vds2 = open_virtual_dataset('air2.nc', loadable_variables=['time', 'lat', 'lon'])

### CF-encoded time variables
combined_vds = xr.combine_by_coords([vds2, vds1], coords='minimal', compat='override')
```

To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
Notice we don't have to specify the concatenation dimension explicitly - xarray works out the correct ordering for us. Even though we actually passed in the virtual datasets in the wrong order just now, the manifest still has the chunks listed in the correct order such that the 1-dimensional ``time`` coordinate has ascending values:

```python
vds = open_virtual_dataset(
'air.nc',
loadable_variables=['air', 'time'],
decode_times=True,
indexes={},
)
combined_vds['air'].data.manifest.dict()
```
```python
<xarray.Dataset> Size: 31MB
Dimensions: (time: 2920, lat: 25, lon: 53)
Coordinates:
lat (lat) float32 100B ManifestArray<shape=(25,), dtype=float32, chu...
lon (lon) float32 212B ManifestArray<shape=(53,), dtype=float32, chu...
time (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
Data variables:
air (time, lat, lon) float64 31MB ...
Attributes:
Conventions: COARDS
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
title: 4x daily NMC reanalysis (1948)
```
{'0.0.0': {'path': 'file:///work/data/air1.nc', 'offset': 15419, 'length': 3869000},
'1.0.0': {'path': 'file:///work/data/air2.nc', 'offset': 15419, 'length': 3869000}}
```

### Ordering using metadata

TODO: Use preprocess to create a new index from the metadata. Requires ``open_virtual_mfdataset`` to be implemented in [PR #349](https://github.com/zarr-developers/VirtualiZarr/pull/349).

## Writing virtual stores to disk

Once we've combined references to all the chunks of all our archival files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
Expand Down Expand Up @@ -439,9 +451,9 @@ This store can however be read by {py:func}`~virtualizarr.open_virtual_dataset`,
You can open existing Kerchunk `json` or `parquet` references as Virtualizarr virtual datasets. This may be useful for converting existing Kerchunk formatted references to storage formats like [Icechunk](https://icechunk.io/).

```python
vds = open_virtual_dataset('combined.json', filetype='kerchunk', indexes={})
vds = open_virtual_dataset('combined.json', filetype='kerchunk')
# or
vds = open_virtual_dataset('combined.parquet', filetype='kerchunk', indexes={})
vds = open_virtual_dataset('combined.parquet', filetype='kerchunk')
```

One difference between the kerchunk references format and virtualizarr's internal manifest representation (as well as icechunk's format) is that paths in kerchunk references can be relative paths. Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the ``fsspec`` filesystem which the filepath was defined relative to.
Expand All @@ -454,7 +466,6 @@ You can dis-ambuiguate kerchunk references containing relative paths by passing
vds = open_virtual_dataset(
'relative_refs.json',
filetype='kerchunk',
indexes={},
virtual_backend_kwargs={'fs_root': 'file:///some_directory/'}
)

Expand Down
13 changes: 6 additions & 7 deletions virtualizarr/readers/common.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
import warnings
from abc import ABC
from collections.abc import Iterable, Mapping, MutableMapping
from typing import (
Expand Down Expand Up @@ -55,13 +54,13 @@ def open_loadable_vars_and_indexes(
)

if indexes is None:
warnings.warn(
"Specifying `indexes=None` will create in-memory pandas indexes for each 1D coordinate, but concatenation of ManifestArrays backed by pandas indexes is not yet supported (see issue #18)."
"You almost certainly want to pass `indexes={}` to `open_virtual_dataset` instead."
)

# add default indexes by reading data from file
indexes = {name: index for name, index in ds.xindexes.items()}
# but avoid creating an in-memory index for virtual variables by default
indexes = {
name: index
for name, index in ds.xindexes.items()
if name in loadable_variables
}
elif indexes != {}:
# TODO allow manual specification of index objects
raise NotImplementedError()
Expand Down
16 changes: 12 additions & 4 deletions virtualizarr/tests/test_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,13 +89,21 @@ def test_no_indexes(self, netcdf4_file, hdf_backend):
vds = open_virtual_dataset(netcdf4_file, indexes={}, backend=hdf_backend)
assert vds.indexes == {}

def test_create_default_indexes(self, netcdf4_file, hdf_backend):
with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
vds = open_virtual_dataset(netcdf4_file, indexes=None, backend=hdf_backend)
def test_create_default_indexes_for_loadable_variables(
self, netcdf4_file, hdf_backend
):
loadable_variables = ["time", "lat"]

vds = open_virtual_dataset(
netcdf4_file,
indexes=None,
backend=hdf_backend,
loadable_variables=loadable_variables,
)
ds = open_dataset(netcdf4_file, decode_times=True)

# TODO use xr.testing.assert_identical(vds.indexes, ds.indexes) instead once class supported by assertion comparison, see https://github.com/pydata/xarray/issues/5812
assert index_mappings_equal(vds.xindexes, ds.xindexes)
assert index_mappings_equal(vds.xindexes, ds[loadable_variables].xindexes)


def index_mappings_equal(indexes1: Mapping[str, Index], indexes2: Mapping[str, Index]):
Expand Down
20 changes: 10 additions & 10 deletions virtualizarr/tests/test_xarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from virtualizarr import open_virtual_dataset
from virtualizarr.manifests import ChunkManifest, ManifestArray
from virtualizarr.readers.hdf import HDFVirtualBackend
from virtualizarr.readers import HDF5VirtualBackend, HDFVirtualBackend
from virtualizarr.tests import requires_kerchunk
from virtualizarr.zarr import ZArray

Expand Down Expand Up @@ -227,15 +227,17 @@ def test_concat_dim_coords_along_existing_dim(self):


@requires_kerchunk
@pytest.mark.parametrize("hdf_backend", [None, HDFVirtualBackend])
@pytest.mark.parametrize("hdf_backend", [HDF5VirtualBackend, HDFVirtualBackend])
class TestCombineUsingIndexes:
def test_combine_by_coords(self, netcdf4_files_factory: Callable, hdf_backend):
filepath1, filepath2 = netcdf4_files_factory()

with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
vds1 = open_virtual_dataset(filepath1, backend=hdf_backend)
with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
vds2 = open_virtual_dataset(filepath2, backend=hdf_backend)
vds1 = open_virtual_dataset(
filepath1, backend=hdf_backend, loadable_variables=["time", "lat", "lon"]
)
vds2 = open_virtual_dataset(
filepath2, backend=hdf_backend, loadable_variables=["time", "lat", "lon"]
)

combined_vds = xr.combine_by_coords(
[vds2, vds1],
Expand All @@ -247,10 +249,8 @@ def test_combine_by_coords(self, netcdf4_files_factory: Callable, hdf_backend):
def test_combine_by_coords_keeping_manifestarrays(self, netcdf4_files, hdf_backend):
filepath1, filepath2 = netcdf4_files

with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
vds1 = open_virtual_dataset(filepath1, backend=hdf_backend)
with pytest.warns(UserWarning, match="will create in-memory pandas indexes"):
vds2 = open_virtual_dataset(filepath2, backend=hdf_backend)
vds1 = open_virtual_dataset(filepath1, backend=hdf_backend)
vds2 = open_virtual_dataset(filepath2, backend=hdf_backend)

combined_vds = xr.combine_by_coords(
[vds2, vds1],
Expand Down

0 comments on commit 3188ca0

Please sign in to comment.