Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/nd2reader-xarray-ome-metadata #533

Merged

Conversation

yichechang
Copy link
Contributor

@yichechang yichechang commented Sep 21, 2023

Description

Include parsed metadata (OME) as attrs of returned xarray DataArray when using nd2_reader.

The OME metadata is already accessible from the nd2_reader since 4.12.1 (specifically from PR #521), but is currently not included in the xarray interface.

It's my first PR and while I tried to dig in to tests for Reader, AICSImage, and extra readers, I didn't see if this specific feature is covered/should be included. I'm probably also missing a lot of things and will be grateful if anyone would point me some reference and/or just say "read this!".

Also - if an issue/discussion is required before opening a PR, please let me know and I'll change this to a draft.

Pull request recommendations:

  • Name your pull request your-development-type/short-description. Ex: feature/read-tiff-files
  • Link to any relevant issue in the PR description. Ex: Resolves [gh-], adds tiff file format support
  • Provide relevant tests for your feature or bug fix.
    • Didn't see where this should be tested or if at all tested for other readers. But it's most likely that I missed something... I'll be happy to include tests if I could get some suggestions/guidance.
    • Updated existing tests to reflect changes in nd2 reader's metadata attribute. Now expecting OME (originally dict) if supported.
  • Provide or update documentation for any feature added by your pull request.
    • This probably is not needed for this PR, but please let me know if I'm wrong.

@toloudis
Copy link
Collaborator

toloudis commented Sep 21, 2023

@yichechang Thank you for the contribution! It would definitely be nice if you could add a unit test to confirm that the exception handling works. (one "good" ome metadata and one "bad"? It may be the case that you need to catch more exception types...)

The place to put it is in https://github.com/AllenCellModeling/aicsimageio/blob/main/aicsimageio/tests/readers/extra_readers/test_nd2_reader.py
Ideally you can add a test that uses the pre-existing test files. If we need to add a test nd2 file it gets more complicated.

@toloudis
Copy link
Collaborator

Actually it looks like the nd2 unit test is actually failing on this already. The problem I'm seeing is that expected_metadata_type is dict for the test, and with this code change maybe that has changed?

@yichechang
Copy link
Contributor Author

yichechang commented Sep 21, 2023

Hi @toloudis - thanks for the reply and guidance!

I saw the failed tests for nd2_reader and at first glance they seem to be related to metadata types, suggesting it has something to do with changes I made. I think I know what happens but will have to think about it more over the weekend as it is related to what behavior one wants for this reader. I have no idea why sldy_reader tests failed though.

If I could also get @tlambert03 your opinion on this, it would be very helpful -

  • I understand that prior to my changes, the unprocessed metadata returned by ND2Reader (via ND2Reader(...).metadata, as well as ND2Reader(...).xarray_data.attrs['unprocessed']) would be a dict of varying classes defined in nd2.structures and they are parsed and useful.
  • However, currently aicsimageio.readers.Reader's metadata attribute favors its xarray_data.attrs['processed'] (which is meant to be of type OME I believe) over xarray_data.attrs['unprocessed'].

This is probably why these tests fail now as 'processed' metadata becomes available for xarray/xarray_dask data. And thus ND2Reader.metadata no longer returns a dict of nd2-specific metadata structures. And this is where I don't have a good intuition for - what is the expected and/or intended behavior for nd2 reader's default metadata?.

If I missed anything or if this feature is against nd2_reader plugin's design, please let me know or tell me if I should close this PR. For my use case (having access to OME metadata in the returned xarray DataArray), I guess I could always just force to use bioformats reader.

@yichechang
Copy link
Contributor Author

A little bit of digging lead to the following observation:

  • aicsimageio.readers.Reader has the following attributes

    • metadata which will look for, in the order of, _metadata > xarray_dask_data.attrs['processed'] > the fallback xarray_dask_data.attrs['unprocessed']. The docstring seems to suggest that this is ideally a parsed/processed python object if the underlying reader supports it, and thus the distinction of process/unprocessed.
    • ome_metadata which needs to be implemented by each specific reader, and is clearly requesting this to be OME.
  • aicsimageio.readers.Reader instance's xarray_dask_data can have

    • .attrs['unprocessed']
    • .attrs['processed']

But I cannot find the specifications of unprocessed vs processed. And presumably the most related documentation is around aicsimageio.readers.Reader.metadata attribute, where it suggests a parsed/useful metadata is to be retrieved from its .xarray_dask_data.attrs['processed'], and .xarray_dask_data.attrs['unprocessed'] if not supported.

So at the reader level, metadata and ome_metadata both exist, and currently metadata will prefer a processed and useful python object if supported (if .xarray_dask_data.attrs['processed'] is available). On the other hand, at the .xarray_dask_data level, there is also two options: unprocessed vs processed. But really, would it seem like there are three options: raw, processed, and processed OME? So choosing which two to provide at either reader vs xarray_dask_data level requires decision and coordination?

Would be great if someone would be able to provide some insights on the design of the API?

@evamaxfield
Copy link
Collaborator

Maybe I can provide some insight!

aicsimageio.readers.Reader has the following attributes

* `metadata` which will look for, in the order of,  `_metadata` > `xarray_dask_data.attrs['processed']` > the fallback `xarray_dask_data.attrs['unprocessed']`. The docstring seems to suggest that this is ideally a parsed/processed python object if the underlying reader supports it, and thus the distinction of process/unprocessed.

* `ome_metadata` which needs to be implemented by each specific reader, and is clearly requesting this to be OME.

This is entirely correct!!

  • aicsimageio.readers.Reader instance's xarray_dask_data can have

    * `.attrs['unprocessed']`
    * `.attrs['processed']`
    

But I cannot find the specifications of unprocessed vs processed. And presumably the most related documentation is around aicsimageio.readers.Reader.metadata attribute, where it suggests a parsed/useful metadata is to be retrieved from its .xarray_dask_data.attrs['processed'], and .xarray_dask_data.attrs['unprocessed'] if not supported.

This is also entirely correct! I don't think we ever full defined what is "good" or "bad" / "unprocessed" and "processed" metadata. The reason being is that some libraries may work best with a dictionary (i.e. metadata from JSON), some may work best with an XML etree (i.e. metadata from XML). In this case, we generally leave the "unprocessed" as the raw metadata, almost in string form if need be and the "processed" as "whatever object is best for the file format".

This is also why we introduced the ome_metadata so that if the metadata object is wildly different between formats, over time, hopefully, more and more file formats would have metadata converters to OME because it is a nice set standard.

We defer in the order of "processed" -> "unprocessed" and "ome_metadata" its it own different thing.

Would be great if someone would be able to provide some insights on the design of the API?

Metadata is hard and annoying so this part of the API has always sort of been up in the air imo.

@evamaxfield
Copy link
Collaborator

Actually it looks like the nd2 unit test is actually failing on this already. The problem I'm seeing is that expected_metadata_type is dict for the test, and with this code change maybe that has changed?

Really all that is needed to make this PR "acceptable" imo is that we update the test to reflect that a new metadata type is returned. The reader metadata code change is totally fine. Because of our chain of deferring "which metadata is available" the actual test should be updated from dict -> OME

@evamaxfield
Copy link
Collaborator

I also just want to say. THANK YOU for this PR and for your deep dive into our API spec. Something for us to get better at.

Copy link
Collaborator

@tlambert03 tlambert03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you! Everything I've read here suggests you've done a careful analysis. You're correct about what nd2 is pumping out of it's processed/unprocessed metadata and I don't really have anything to add. I agree with @evamaxfield that your code appears to be correct, and tests fails appear to be on the side of assumptions made in the aicsimageio tests. I'm also not entirely sure, so the team here is best suited to help with that. But I'm very happy with this change.

@yichechang
Copy link
Contributor Author

Hey everyone!

Big thanks for all your feedbacks and for making and maintaining this amazing tool—I'm a huge fan! And kudos for fostering such a friendly community.

This is also why we introduced the ome_metadata so that if the metadata object is wildly different between formats, over time, hopefully, more and more file formats would have metadata converters to OME because it is a nice set standard.

We defer in the order of "processed" -> "unprocessed" and "ome_metadata" its it own different thing.

So, if I've got this right,

  1. the way ND2Reader.metadata is currently returning the nd2-specific metadata (not OME) aligns with the design
  2. but it’s a tad confusing as the metadata isn’t coming back under ND2Reader.xarray_data.attrs['processed'], but instead, it’s under 'unprocessed'
  3. I’m also getting that I shouldn't look for OME in "processed," as it’s in a league of its own.

I’ve laid out a few possibilities (current behavior, this PR, and other solutions) and their implications below. I was all in for the changes in the current PR, but now, I'm wondering if it’s really the best fit for aiscimageio overall. If you guys think it’s a good PR, I will fix the failing tests soon or within the month. But if it’s better left unmerged or if there’s a better solution, I’m cool with that too.

Looking forward to your thoughts! 🙌


Current (no change) Solution 1 (current PR) Solution 2 Solution 3 Solution 4
Behavior
(1)ND2Reader.metadata nd2-specific dict OME nd2-specific dict nd2-specific dict nd2-specific dict
(2) ND2Reader.ome_metadata OME OME OME OME OME
(3) ND2Reader.xarray_dask_data.attrs['unprocessed'] nd2-specific dict nd2-specific dict No such key. Or dump raw metadata? No such key. Or dump raw metadata? nd2-specific dict
(4) ND2Reader.xarray_dask_data.attrs['processed'] No such key. OME nd2-specific dict nd2-specific dict nd2-specific dict
(5) ND2Reader.xarray_dask_data.attrs['ome'] No such key. No such key. No such key. OME OME
Impact
😥 A little bit confusing when not getting OME as xarray object’s “processed”. Breaking users expecting nd2’s metadata structure to be returned via (1). (Still accessible with workaround.) Breaking users expecting nd2’s metadata structure to be returned via (3). (Still accessible with workaround.)

Doesn’t add convenience.
Breaking users expecting nd2’s metadata structure to be returned via (3). (Still accessible with workaround.)


Introducing a new attrs key of returned xarray objects -> requires changes in other readers…
Introducing a new attrs key of returned xarray objects -> requires changes in other readers…

Duplicated metadata in xarray objects.
😃 Nothing changes. Consistently getting ome metadata with DataArray returned by AICSImage.xarray_data, irrespective of readers (nd2 or bio formats). Closest to the original design. (i.e., what you would expect to get after reading the current docs). Closest to the original design. Plus still getting OME automatically in xarray objects. Also close to the original design.

No breaking changes.

@codecov-commenter
Copy link

codecov-commenter commented Sep 24, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (1e3d0ad) 89.54% compared to head (ed62d8a) 89.55%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #533   +/-   ##
=======================================
  Coverage   89.54%   89.55%           
=======================================
  Files          53       53           
  Lines        4658     4662    +4     
=======================================
+ Hits         4171     4175    +4     
  Misses        487      487           
Files Coverage Δ
aicsimageio/readers/nd2_reader.py 90.56% <100.00%> (+0.77%) ⬆️
...eio/tests/readers/extra_readers/test_nd2_reader.py 96.87% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yichechang
Copy link
Contributor Author

yichechang commented Sep 24, 2023

I've updated the tests to reflect the new behavior introduced in this PR: ND2Reader(...).metadata will return OME (if ND2Reader.ome_metadata returns it) or dict (if legacy nd2 file as input and the .ome_metadata raises NotImplementedError).

The tests does not cover situation when nd2 < v0.7.0. I am not sure how to test package versions, and if that needs to be tested, I think it might be best to include in tests for .ome_metadata method? (Or both?)

@tlambert03
Copy link
Collaborator

The tests does not cover situation when nd2 < v0.7.0. I am not sure how to test package versions, and if that needs to be tested, I think it might be best to include in tests for .ome_metadata method? (Or both?)

I don't think that needs to be tested here. This is a one-line hasattr test and your catching of NotImplementedError is pretty fail-proof. From the perspective of aicsimageio, readers either return an instance of ome_types.OME from ome_metadata or they dont... all the rest of the testing should be over at https://github.com/tlambert03/nd2 itself.

for all the rest of the stuff (i.e. what each reader is supposed to return from .metadata , 'unprocessed', 'processed', etc...) we'll need @toloudis, @evamaxfield, or others at aicsimageio to weigh in on what the API is supposed to be

@yichechang
Copy link
Contributor Author

Hi @tlambert03, thank you so much for the detailed explanation and patience! In hindsight, this is such a small PR and I am very grateful for receiving lots of useful feedback and help from you all.

@tlambert03
Copy link
Collaborator

such a small PR and I am very grateful for receiving lots of useful feedback and help from you all.

small in terms of lines of code perhaps, but the analysis you did in #533 (comment) is kind of epic and the sort of contribution that all maintainers dream of getting :)

@toloudis
Copy link
Collaborator

The key comment is here:

If the inheriting Reader supports processing the metadata into a more useful

Basically readers can do almost anything they want with PROCESSED and UNPROCESSED. The ome metadata property is the only one that has a required output as an OME object from ome-types.

The spirit of PROCESSED vs UNPROCESSED is this (@evamaxfield please correct me if I am remembering this wrong!):

UNPROCESSED: the native file format's metadata as read directly from the file. Can often be just a string or a dict.

PROCESSED: the native file format's metadata as a higher level object. Up the the discretion of the implementor. Can be a dict, an xml tree, an ome-types object etc. The contract is specific to each reader if metadata is being filtered or modified.

@evamaxfield
Copy link
Collaborator

UNPROCESSED: the native file format's metadata as read directly from the file. Can often be just a string or a dict.

PROCESSED: the native file format's metadata as a higher level object. Up the the discretion of the implementer. Can be a dict, an xml tree, an ome-types object etc. The contract is specific to each reader if metadata is being filtered or modified.

I think this is correct / aligns with my original intent.

@yichechang @toloudis Directly to this PR, I think it is fine for the "PROCESSED" metadata to be the OME object. It is the "more useful / most useful version of the metadata". In which case the only change needed is that the test should be updated to expect an OME object.


small in terms of lines of code perhaps, but the analysis you did in #533 (comment) is kind of epic and the sort of contribution that all maintainers dream of getting :)

Seconding this comment. When I saw your comment I was over the moon. It is not only helpful for the PR but also for knowing where our existing API is under-defined / hard to follow / poorly documented.

@yichechang
Copy link
Contributor Author

Thank you all again for your kind words and insights : )

In the most recent checks, all nd2 tests have passed I believe. Those that failed are

  • all sldy
  • all bfio
  • one omezarr which was canceled after almost 6 hours
  • one "lint"

Any idea if some of these might be directly related to this PR?

@yichechang
Copy link
Contributor Author

I thought that the most recent PR #529 merged to main might have something to do with the tests failed here (as mentioned above, for sldy and bfio), so was trying to sync my fork & branch directly on GitHub. But I have a feeling that I did something wrong...

Now I have a merge commit (commit ed62d8a) and I hope it's not horrible. Let me know if anything needs me fixing!

Copy link
Collaborator

@SeanLeRoy SeanLeRoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you for the pull request!

@SeanLeRoy
Copy link
Collaborator

Merging this PR, I believe sldy & bfio errors to be from something else unrelated. Will be looking into that today :)

@SeanLeRoy SeanLeRoy merged commit 942cc53 into AllenCellModeling:main Oct 18, 2023
95 of 122 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants