Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype subpackages along processing levels #817

Closed
wants to merge 17 commits into from

Conversation

leewujung
Copy link
Member

@leewujung leewujung commented Sep 17, 2022

NOT ready for review/discussion yet! I am still moving things around and will request for review when ready.

This PR prototypes subpackages that lie along data processing levels to bring data beyond the calibrated Sv. These include (from low to high processing levels, with some mixing in between):

  • convert (existing)
  • calibrate (existing)
  • filter: remove noise or reducing variability in the data
  • unify
  • mask
  • metrics: summary statistics and NASC

@leewujung leewujung marked this pull request as draft September 17, 2022 21:31
@leewujung leewujung marked this pull request as draft September 17, 2022 21:31
@leewujung leewujung marked this pull request as draft September 17, 2022 21:31
@leewujung leewujung marked this pull request as draft September 17, 2022 21:31
@leewujung leewujung marked this pull request as draft September 17, 2022 21:31
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2022

Codecov Report

Merging #817 (2036d9b) into dev (f92f9a0) will decrease coverage by 25.95%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##              dev     #817       +/-   ##
===========================================
- Coverage   82.20%   56.25%   -25.96%     
===========================================
  Files          52        2       -50     
  Lines        5018       48     -4970     
===========================================
- Hits         4125       27     -4098     
+ Misses        893       21      -872     
Flag Coverage Δ
unittests 56.25% <0.00%> (-25.96%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
echopype/metrics/nasc.py 0.00% <0.00%> (ø)
echopype/consolidate/api.py
echopype/convert/parse_base.py
echopype/convert/utils/ek_raw_parsers.py
echopype/convert/set_groups_base.py
echopype/convert/parse_ek60.py
echopype/echodata/convention/utils.py
echopype/core.py
echopype/convert/utils/ek_date_conversion.py
echopype/convert/__init__.py
... and 42 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@leewujung
Copy link
Member Author

@emiliom : a question for you:
Currently we have qc as a subpackage that only contains checking and correcting for the monotonicity of the ping_time coordinate. But in this framing PR I put noise removal functions (e.g., thresholding based on estimated background noise, removing pings with spikes) in the filter subpacakge along with smoothing functions (e.g., median filter). Do you think all of these should just be in filter? Somehow it feels awkward to think of correcting time coordinates as a type of filtering, since it is more like rectifying things, but maybe it is ok?

@emiliom
Copy link
Collaborator

emiliom commented Oct 28, 2022

@leewujung can you add links to references or web pages that describe the calculations? Some of those may be found in https://github.com/open-ocean-sounding/echopy

TODO:

  • Add one-sentence descriptions of every subpackage category
  • Review @leewujung 's preceding comments about qc subpackage

Below I flesh out individual modules or functions as found in this PR, so we can see it in one place. (E) = Existing

@b-reyes
Copy link
Contributor

b-reyes commented Dec 12, 2022

Based on discussion with @leewujung and @emiliom we have agreed that the mask sub-package functions should only return a mask and not apply the mask. For this reason, we need to create a function that applies masks, say apply_mask. This function will take in a Dataset or DataArray (this is still up for discussion) and a list of masks to apply to the input Dataset or DataArray. Additionally, we need to decide where this function should live (perhaps the mask sub-package is the most natural location).

@leewujung
Copy link
Member Author

Thanks for the summary @b-reyes. My inclination is:

This function will take in a Dataset or DataArray (this is still up for discussion) and a list of masks to apply to the input Dataset or DataArray.

My inclination is to restrict both to be a Dataset for the following reasons:

  • consistency: all of our other processing functions take in/produce Dataset, so this is easy to remember
  • flexibility: allow users to optionally store other metadata with the mask zarr file without interfering with the functionality
  • xarray to_zarr converts DataArray to Dataset before storing into zarr
  • xarray open_dataset reads zarr into a Dataset and we are familiar with its behavior

Additionally, we need to decide where this function should live (perhaps the mask sub-package is the most natural location).

I agree that under the mask subpackage is the best place.

@emiliom
Copy link
Collaborator

emiliom commented Dec 14, 2022

Additionally, we need to decide where this function should live (perhaps the mask sub-package is the most natural location).

I, too, agree with apply_mask being in the mask subpackage.

This function will take in a Dataset or DataArray (this is still up for discussion) and a list of masks to apply to the input Dataset or DataArray.

I'm on board with the function taking in an Sv Dataset. But I'd like to go further in the specification:

  • In principle if more than one mask is passed to apply_mask , these may come in the form of more than one Dataset. The names of the mask variables will most likely not be the same across masks. So, to fully specify a mask for use in apply_mask, the variable's Dataset (or path) and name will be needed.
  • The Sv source will be also specified via a Dataset and the name of the Sv variable name. Keep in mind that echopype can generate an Sv variable from either compute_Sv, in which case the variable name is Sv, or from preprocess.remove_noise, in which case the name is Sv_corrected.
  • apply_mask output:
    • apply_mask will return a Dataset, right? Should this be a new Dataset, or is there a use case for also adding the new masked Sv to the source Sv dataset?
    • provide an option to include the mask variable in the output Dataset; if multiple masks were passed, only the single, combined mask should be included.

It may be helpful to look over the estimate_noise and remove_noise functions while working on this. There are many parallels.

@b-reyes
Copy link
Contributor

b-reyes commented Dec 14, 2022

I, too, agree with apply_mask being in the mask subpackage.

Great, it sounds like we all agree on the location of the apply_mask function!

In principle if more than one mask is passed to apply_mask , these may come in the form of more than one Dataset. The names of the mask variables will most likely not be the same across masks. So, to fully specify a mask for use in apply_mask, the variable's Dataset (or path) and name will be needed.

This is a good point that I had not thought of. Also supplying the name seems like a cumbersome task for the user. Perhaps this function should only take in a DataArray? That way we do not have to worry about this.

The Sv source will be also specified via a Dataset and the name of the Sv variable name. Keep in mind that echopype can generate an Sv variable from either compute_Sv, in which case the variable name is Sv, or from preprocess.remove_noise, in which case the name is Sv_corrected.

Another good point and something that is not accounted for in #901. I do not like the idea of the user needing to supply this name, but it seems like it might be needed. @leewujung pinging you as we should decide on this before full review of #901.

apply_mask will return a Dataset, right? Should this be a new Dataset, or is there a use case for also adding the new masked Sv to the source Sv dataset?

I would assume that it would output a new Dataset, so the original source could remain unmodified. However, we could always make this an input flag, say inplace.

provide an option to include the mask variable in the output Dataset; if multiple masks were passed, only the single, combined mask should be included.

I think including the mask in the output Dataset is not preferable. The masks are already an input of the function, so the user could already save them, if they want to. Additionally, attaching the mask would increase the data size. However, If we really want to make this an option, we could have a keyword, say include_mask. We could also include all masks individually that were passed and attach them as variables using their provided names (thinking about the size increase maybe the single combined mask is a better idea).

@leewujung
Copy link
Member Author

leewujung commented Dec 14, 2022

Thanks @b-reyes @emiliom for the discussions above.

Input format:

  • For whether the input mask should be a DataArray or Dataset or both. My suggestion is that we do not overthink and implement something straightforward that works as a first pass. The masks we will be applying at this first pass will be generated by echopype and under our control (currently only the one from freq-diff that is very simplistic), and we also need to allow having those already saved as zarr somewhere.
  • I am fine with going the DataArray route, and think we should allow only DataArray and path (pathlib and str), and not worry about accommodating Dataset at this moment. Selecting a data variable is easy for the user to do and the input would look like something like [ds_mask1["maskA"], ds_mask2["maskB"], ds_mask3["maskC"]].

Remove noise data variable name:

  • these are the functions I have not "touched" yet in the overhaul, but I think Sv_corrected should just be named Sv and attached with proper provenance
  • @emiliom @b-reyes : I have not looked at these functions for a long time, so I would caution on taking the implementation and pattern as is
  • supplying a data variable name in the input of apply_mask to be masked is straightforward to do. I think we should allow that.

Output:

  • I think it is messy to store the masks with the output data and don't think we should provide this as an option.
  • Instead, I think we should have better provenance to document what is taken in and coming out.

I'll ping both of you off github to resolve these discussions.

@b-reyes
Copy link
Contributor

b-reyes commented Dec 14, 2022

Based on discussions between @leewujung and myself, we have arrived at the following for apply_mask:

  • It will be in the mask sub-package and thus be accessed as echopype.mask.apply_mask(...)
  • It will have the following function inputs:
    • source_ds - a Dataset or a path to a Dataset that contains the variable we want to apply a mask to.
    • var_name - a string corresponding to the variable name in source_ds that the mask should be applied to
    • mask the actual mask to apply to the variable specified by var_name. This input can be a list or a single input that corresponds to a DataArray or a path to a DataArray
    • fill_value - specifies the value(s) at false indices. This variable can be a scalar, array, Variable, or DataArray (see xr.where for descriptions of these types).
  • The output will be a Dataset with the same format of source_ds with the mask applied to var_name
    • We will not save this output for them to a zarr or netcdf
    • If source_ds is lazy, so will the output Dataset (likewise for in-memory)
  • Other items discussed:
    • We will not store the final mask or any mask in the output Dataset
    • When implementing the method we will use xr.where with x=source_ds[var_name] and y=fill_value

@b-reyes
Copy link
Contributor

b-reyes commented Dec 15, 2022

In the above comment I mention that we should let fill_value be a xr.Variable. After further investigation, I think we should not allow for this type. The reason for this is the documentation states: "However, manipulating data in the form of a Dataset or DataArray should almost always be preferred [in comparison to a xr.Variable], because they can use more complete metadata in context of coordinate labels."

@leewujung
Copy link
Member Author

I agree with omitting xr.Variable.

@emiliom
Copy link
Collaborator

emiliom commented Dec 15, 2022

fill_value - specifies the value(s) at false indices. This variable can be a scalar, array, Variable, or DataArray (see xr.where for descriptions of these types).

Just a comment that in the context of xarray (and netcdf/zarr) a "fill value" is typically (or always?) a scalar. It's intended to be literally a single value. I see that the types you've listed come directly from xr.where, which makes sense in the more general context of that function. But in the context of applying a mask, we're talking about a scalar. Or, if you actually intended something different, I suggest you use a name other than fill_value.

@leewujung
Copy link
Member Author

@emiliom : yes we did discuss limiting it to only a scalar, but since we are using xr.where under the hood, seems a convenient to just allow a DataArray. In our own use of masks, it will just be NaN by default.

@b-reyes
Copy link
Contributor

b-reyes commented Dec 16, 2022

In addition to the inputs provided above for apply_mask, we will also need storage options for both the source_ds and mask inputs (as they can be paths). @leewujung and I have settled on storage_options_ds: dict = {} and storage_options_mask: Union[dict, List[dict]] = {} for source_ds and mask, respectively.

@emiliom
Copy link
Collaborator

emiliom commented Mar 18, 2023

Current state of individual modules or functions, existing and planned, organized by subpackage. The order of subpackages roughly reflects processing order.

TODO:

  • Add one-sentence descriptions of every subpackage category

(E) = Existing

  • convert (E)
    • echodata.update_platform (E). This isn't a "convert" step per se, but it's closely related and it's important from a processing level perspective.
    • Include here combine_echodata capability, conceptually?
  • qc
    • coerce_increasing_time (E)
    • exist_reversed_time (E)
  • calibrate (E)
  • consolidate (E), not in @leewujung's list above but included among the modules edited in this PR
    • add_location (E)
    • add_depth (E)
    • add_splitbeam_angle (E)
    • swap_dims_channel_frequency (E). This is a lower-level function that rearranges data a bit, doesn't create a new product or add new data per se. Move it to qc? (Update: It's not quite qc functionality; it's more about user friendliness)
  • clean: remove noise or reducing variability in the data
    • median
    • conv
    • remove_noise (E)
    • estimate_noise (E)
    • mean_bkg
    • spike
  • commongrid
  • mask
  • metrics
    • summary statistics. See initial (draft?) implementation at metrics/summary_statistics.py
      • delta_z (E)
      • convert_to_linear (E)
      • abundance (E)
      • center_of_mass (E)
      • dispersion (E)
      • evenness (E)
      • aggregation (E)

@leewujung
Copy link
Member Author

I'll close this PR now since it is very outdated. Some of the functions mentioned here in prototype form have been implemented, and others either won't be implemented in echopype (those related to region and line labels under mask -- they are now under echoregions) or are in active PRs (those related to noise removal under clean).

@leewujung leewujung closed this Feb 25, 2024
@leewujung leewujung deleted the add-subpkg branch July 21, 2024 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants