Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata rework #33

Merged
merged 16 commits into from
Aug 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 45 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,43 +136,70 @@ $ ome2024-ngff-challenge --input-bucket=bia-integrator-data --input-endpoint=htt

</details>

## Converting your data
## CLI Commands

### `resave`: convert your data

The `ome2024-ngff-challenge` tool can be used to convert an OME-Zarr 0.4 dataset
that is based on Zarr v2. The input data will **not be modified** in any way and
a full copy of the data will be created at the chosen location.

### Getting started
#### Getting started

```
ome2024-ngff-challenge input.zarr output.zarr
ome2024-ngff-challenge resave --cc-by input.zarr output.zarr
```

is the most basic invocation of the tool. If you do not choose a license, the
application will fail with:

```
No license set. Choose one of the Creative Commons license (e.g., `--cc-by`) or skip RO-Crate creation (`--rocrate-skip`)
```

#### Licenses

There are a number of other license options to choose from. We suggest one of:

- `--cc-by` credit must be given to the creator
- `--cc0`: Add your data to the public domain

Alternatively, you can choose your own license, e.g.,

`--rocrate-license=https://creativecommons.org/licenses/by-nc/4.0/`

to restrict commercial use of your data. Additionally, you can disable metadata
collection at all.

**Note:** you will need to add metadata later for your dataset to be considered
valid.

#### Re-running the script

is the most basic invocation of the tool. If you would like to re-run the script
with different parameters, you can additionally set `--output-overwrite` to
ignore a previous conversion:
If you would like to re-run the script with different parameters, you can
additionally set `--output-overwrite` to ignore a previous conversion:

```
ome2024-ngff-challenge input.zarr output.zarr --output-overwrite
ome2024-ngff-challenge resave --cc-by input.zarr output.zarr --output-overwrite
```

### Writing in parallel
#### Writing in parallel

By default, 16 chunks of data will be processed simultaneously in order to bound
memory usage. You can increase this number based on your local resources:

```
ome2024-ngff-challenge input.zarr output.zarr --output-threads=128
ome2024-ngff-challenge resave --cc-by input.zarr output.zarr --output-threads=128
```

### Reading/writing remotely
#### Reading/writing remotely

If you would like to avoid downloading and/or upload the Zarr datasets, you can
set S3 parameters on the command-line which will then treat the input and/or
output datasets as a prefix within an S3 bucket:

```
ome2024-ngff-challenge \
ome2024-ngff-challenge resave --cc-by \
--input-bucket=BUCKET \
--input-endpoint=HOST \
--input-anon \
Expand All @@ -183,15 +210,15 @@ ome2024-ngff-challenge \
A small example you can try yourself:

```
ome2024-ngff-challenge \
ome2024-ngff-challenge resave --cc-by \
--input-bucket=idr \
--input-endpoint=https://uk1s3.embassy.ebi.ac.uk \
--input-anon \
zarr/v0.4/idr0062A/6001240.zarr \
/tmp/6001240.zarr
```

### Reading/writing via a script
#### Reading/writing via a script

Another R/W option is to have `resave.py` generate a script which you can
execute later. If you pass `--output-script`, then rather than generate the
Expand All @@ -201,7 +228,7 @@ executed later.
For example, running:

```
ome2024-ngff-challenge dev2/input.zarr /tmp/scripts.zarr --output-script
ome2024-ngff-challenge resave --cc-by dev2/input.zarr /tmp/scripts.zarr --output-script
```

produces a dataset with one `zarr.json` file and 3 `convert.sh` scripts:
Expand All @@ -225,22 +252,22 @@ cargo install zarrs_tools
export PATH=$PATH:$HOME/.cargo/bin
```

### Optimizing chunks and shards
#### Optimizing chunks and shards

Finally, there is not yet a single heuristic for determining the chunk and shard
sizes that will work for all data. Pass the `--output-chunks` and
`--output-shards` flags in order to set the size of chunks and shards for all
resolutions:

```
ome2024-ngff-challenge input.zarr output.zarr --output-chunks=1,1,1,256,256 --output-shards=1,1,1,2048,2048
ome2024-ngff-challenge resave --cc-by input.zarr output.zarr --output-chunks=1,1,1,256,256 --output-shards=1,1,1,2048,2048
```

Alternatively, you can use a JSON file to review and manually optimize the
chunking and sharding parameters on a per-resolution basis:

```
ome2024-ngff-challenge input.zarr parameters.json --output-write-details
ome2024-ngff-challenge resave --cc-by input.zarr parameters.json --output-write-details
```

This will write a JSON file of the form:
Expand All @@ -254,7 +281,7 @@ the "multiscales". Edits to this file can be read back in using the
`output-read-details` flag:

```
ome2024-ngff-challenge input.zarr output.zarr --output-read-details=parameters.json
ome2024-ngff-challenge resave --cc-by input.zarr output.zarr --output-read-details=parameters.json
```

Note: Changes to the shape are ignored.
Expand Down
2 changes: 1 addition & 1 deletion dev2/resave.py
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,7 @@ def write_rocrate(write_store):
properties={
"name": "Light microscopy photo of a fly",
"description": "Light microscopy photo of a fruit fly.",
"licence": "https://creativecommons.org/licenses/by/4.0/",
"license": "https://creativecommons.org/licenses/by/4.0/",
},
)
biosample = crate.add(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
"@type": "Dataset",
"name": "Light microscopy photo of a fly",
"description": "Light microscopy photo of a fruit fly.",
"licence": "https://creativecommons.org/licenses/by/4.0/",
"license": "https://creativecommons.org/licenses/by/4.0/",
"hasPart": {
"@id": "./dros-mel-image.zarr/"
}
Expand Down
2 changes: 1 addition & 1 deletion dev3/2024-07-02/example-metadata/minimal.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
"@type": "Dataset",
"name": "Light microscopy photo of a fly",
"description": "Light microscopy photo of a fruit fly.",
"licence": "https://creativecommons.org/licenses/by/4.0/",
"license": "https://creativecommons.org/licenses/by/4.0/",
"hasPart": {
"@id": "./dros-mel-image.zarr/"
}
Expand Down
4 changes: 2 additions & 2 deletions dev3/2024-07-02/ro-crate-metadata-proposal.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ imagining technique.
"@type": "Dataset",
"name": "Light microscopy photo of a fly",
"description": "Light microscopy photo of a fruit fly.",
"licence": "https://creativecommons.org/licenses/by/4.0/",
"license": "https://creativecommons.org/licenses/by/4.0/",
"hasPart": {
"@id": "./dros-mel-image.zarr/"
}
Expand Down Expand Up @@ -138,7 +138,7 @@ The metadata json file would look like:
"@type": "Dataset",
"name": "Light microscopy photo of a fly",
"description": "Light microscopy photo of a fruit fly.",
"licence": "https://creativecommons.org/licenses/by/4.0/",
"license": "https://creativecommons.org/licenses/by/4.0/",
"hasPart": {
"@id": "./dros-mel-image.zarr/"
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
properties={
"name": "Light microscopy photo of a fly",
"description": "Light microscopy photo of a fruit fly.",
"licence": "https://creativecommons.org/licenses/by/4.0/",
"license": "https://creativecommons.org/licenses/by/4.0/",
},
)
biosample = crate.add(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
"@type": "Dataset",
"name": "Light microscopy photo of a fly",
"description": "Light microscopy photo of a fruit fly.",
"licence": "https://creativecommons.org/licenses/by/4.0/",
"license": "https://creativecommons.org/licenses/by/4.0/",
"resultOf": {
"@id": "#16e30b5b-9995-4ff2-97e6-66a9c025f0d3"
}
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ enable = true
substitution.files = ["src/ome2024_ngff_challenge/__init__.py"]

[tool.poetry.scripts]
ome2024-ngff-challenge = "ome2024_ngff_challenge.resave:cli"
ome2024-ngff-challenge = "ome2024_ngff_challenge:dispatch"

[tool.setuptools_scm]
write_to = "src/ome2024_ngff_challenge/_version.py"
Expand Down
25 changes: 25 additions & 0 deletions src/ome2024_ngff_challenge/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,31 @@

from __future__ import annotations

import argparse
import sys

from .lookup import cli as lookup_cli
from .resave import cli as resave_cli

__version__ = "0.0.0"

__all__ = ["__version__"]


def dispatch(args=sys.argv[1:]):
"""
Parses the arguments contained in `args` and passes
them to `main`. If no images are converted, raises
SystemExit. Otherwise, return the number of images.
"""
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
subparsers = parser.add_subparsers(help="subparser help")
resave_cli(subparsers)
lookup_cli(subparsers)
# Upcoming parsers to be moved to submodules
subparsers.add_parser("validate", help="TBD: evaluate a converted fileset locally")
subparsers.add_parser(
"update", help="TBD: updated the RO-Crate metadata in a fileset"
)
ns = parser.parse_args(args)
return ns.func(ns)
67 changes: 67 additions & 0 deletions src/ome2024_ngff_challenge/lookup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
from __future__ import annotations

import argparse
import logging

import requests

from .utils import configure_logging

LOGGER = logging.getLogger(__file__)


def cli(subparsers: argparse._SubParsersAction):
cmd = "ome2024-ngff-challenge lookup"
desc = f"""


The `lookup` subcommand will take search the EBI OLS service
for metadata identifiers matching the given input.


BASIC

Simplest example: {cmd} "light-sheet"


"""
parser = subparsers.add_parser(
"lookup",
help="lookup metadata from EBI OLS",
description=desc,
formatter_class=argparse.RawTextHelpFormatter,
)
parser.set_defaults(func=main)
parser.add_argument(
"--log", default="info", help="'error', 'warn', 'info', 'debug' or 'trace'"
)
parser.add_argument("text")


def parse(ns: argparse.Namespace):
"""
Parse the namespace arguments provided by the dispatcher
"""

configure_logging(ns, LOGGER)


def main(ns: argparse.Namespace):
text = ns.text
url = f"https://www.ebi.ac.uk/ols4/api/search?q={text}&obsoletes=false&local=false&rows=10&start=0&format=json&lang=en"
response = requests.get(url, timeout=(5, 30))
if response.status_code == 200:
result = response.json()
docs = result["response"]["docs"]
header = "ONTOLOGY \tTERM \tLABEL \tDESCRIPTION"
print(header) # noqa: T201
for doc in docs:
onto = doc["ontology_name"]
term = doc["short_form"]
name = doc["label"]
desc = "" if not doc["description"] else doc["description"][0]
desc = desc.split("\n")[0][:70] # At most first 70 chars of first line
print(f"""{onto:10s}\t{term:20s}\t{name:30s}\t{desc}""") # noqa: T201

else:
raise Exception(response)
Loading