Prototype subpackages along processing levels #817

leewujung · 2022-09-17T21:30:51Z

NOT ready for review/discussion yet! I am still moving things around and will request for review when ready.

This PR prototypes subpackages that lie along data processing levels to bring data beyond the calibrated Sv. These include (from low to high processing levels, with some mixing in between):

convert (existing)
calibrate (existing)
filter: remove noise or reducing variability in the data
unify
mask
metrics: summary statistics and NASC

…to dev

for more information, see https://pre-commit.ci

…to add-subpkg

for more information, see https://pre-commit.ci

codecov-commenter · 2022-09-26T18:16:09Z

Codecov Report

Merging #817 (2036d9b) into dev (f92f9a0) will decrease coverage by 25.95%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##              dev     #817       +/-   ##
===========================================
- Coverage   82.20%   56.25%   -25.96%     
===========================================
  Files          52        2       -50     
  Lines        5018       48     -4970     
===========================================
- Hits         4125       27     -4098     
+ Misses        893       21      -872

Flag	Coverage Δ
unittests	`56.25% <0.00%> (-25.96%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
echopype/metrics/nasc.py	`0.00% <0.00%> (ø)`
echopype/consolidate/api.py
echopype/convert/parse_base.py
echopype/convert/utils/ek_raw_parsers.py
echopype/convert/set_groups_base.py
echopype/convert/parse_ek60.py
echopype/echodata/convention/utils.py
echopype/core.py
echopype/convert/utils/ek_date_conversion.py
echopype/convert/__init__.py
... and 42 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

leewujung · 2022-09-26T18:19:05Z

@emiliom : a question for you:
Currently we have qc as a subpackage that only contains checking and correcting for the monotonicity of the ping_time coordinate. But in this framing PR I put noise removal functions (e.g., thresholding based on estimated background noise, removing pings with spikes) in the filter subpacakge along with smoothing functions (e.g., median filter). Do you think all of these should just be in filter? Somehow it feels awkward to think of correcting time coordinates as a type of filtering, since it is more like rectifying things, but maybe it is ok?

emiliom · 2022-10-28T01:59:25Z

@leewujung can you add links to references or web pages that describe the calculations? Some of those may be found in https://github.com/open-ocean-sounding/echopy

TODO:

Add one-sentence descriptions of every subpackage category
Review @leewujung 's preceding comments about qc subpackage

Below I flesh out individual modules or functions as found in this PR, so we can see it in one place. (E) = Existing

convert (E)
calibrate (E)
consolidate (E), not in @leewujung's list above but included among the modules edited in this PR
- add_location (E)
- add_splitbeam_angle. See Add add_splitbeam_angle function #890
filter: remove noise or reducing variability in the data
- median
- conv
- remove_noise (E)
- noise
  - estimate_noise (E)
  - mean_bkg
  - spike
unify
- compute_MVBS (E)
- compute_MVBS_index_binning (E)
- regrid_Sv. See Add function to regrid Sv data #726
mask: The current scope of this subpackage is the creation of masks, not necessarily the application of masks
- from_Sv
  - freq_diff
- from_labels
  - boundary
  - region
  - For bottom removal discussion, see Add function to remove bottom #729
metrics
- summary statistics. See initial (draft?) implementation at metrics/summary_statistics.py
- nasc
  - compute_NASC. See Add function to produce NASC #730

b-reyes · 2022-12-12T21:47:54Z

Based on discussion with @leewujung and @emiliom we have agreed that the mask sub-package functions should only return a mask and not apply the mask. For this reason, we need to create a function that applies masks, say apply_mask. This function will take in a Dataset or DataArray (this is still up for discussion) and a list of masks to apply to the input Dataset or DataArray. Additionally, we need to decide where this function should live (perhaps the mask sub-package is the most natural location).

leewujung · 2022-12-12T22:12:59Z

Thanks for the summary @b-reyes. My inclination is:

This function will take in a Dataset or DataArray (this is still up for discussion) and a list of masks to apply to the input Dataset or DataArray.

My inclination is to restrict both to be a Dataset for the following reasons:

consistency: all of our other processing functions take in/produce Dataset, so this is easy to remember
flexibility: allow users to optionally store other metadata with the mask zarr file without interfering with the functionality
xarray to_zarr converts DataArray to Dataset before storing into zarr
xarray open_dataset reads zarr into a Dataset and we are familiar with its behavior

Additionally, we need to decide where this function should live (perhaps the mask sub-package is the most natural location).

I agree that under the mask subpackage is the best place.

emiliom · 2022-12-14T00:39:01Z

Additionally, we need to decide where this function should live (perhaps the mask sub-package is the most natural location).

I, too, agree with apply_mask being in the mask subpackage.

This function will take in a Dataset or DataArray (this is still up for discussion) and a list of masks to apply to the input Dataset or DataArray.

I'm on board with the function taking in an Sv Dataset. But I'd like to go further in the specification:

In principle if more than one mask is passed to apply_mask , these may come in the form of more than one Dataset. The names of the mask variables will most likely not be the same across masks. So, to fully specify a mask for use in apply_mask, the variable's Dataset (or path) and name will be needed.
The Sv source will be also specified via a Dataset and the name of the Sv variable name. Keep in mind that echopype can generate an Sv variable from either compute_Sv, in which case the variable name is Sv, or from preprocess.remove_noise, in which case the name is Sv_corrected.
apply_mask output:
- apply_mask will return a Dataset, right? Should this be a new Dataset, or is there a use case for also adding the new masked Sv to the source Sv dataset?
- provide an option to include the mask variable in the output Dataset; if multiple masks were passed, only the single, combined mask should be included.

It may be helpful to look over the estimate_noise and remove_noise functions while working on this. There are many parallels.

b-reyes · 2022-12-14T01:16:28Z

I, too, agree with apply_mask being in the mask subpackage.

Great, it sounds like we all agree on the location of the apply_mask function!

In principle if more than one mask is passed to apply_mask , these may come in the form of more than one Dataset. The names of the mask variables will most likely not be the same across masks. So, to fully specify a mask for use in apply_mask, the variable's Dataset (or path) and name will be needed.

This is a good point that I had not thought of. Also supplying the name seems like a cumbersome task for the user. Perhaps this function should only take in a DataArray? That way we do not have to worry about this.

The Sv source will be also specified via a Dataset and the name of the Sv variable name. Keep in mind that echopype can generate an Sv variable from either compute_Sv, in which case the variable name is Sv, or from preprocess.remove_noise, in which case the name is Sv_corrected.

Another good point and something that is not accounted for in #901. I do not like the idea of the user needing to supply this name, but it seems like it might be needed. @leewujung pinging you as we should decide on this before full review of #901.

apply_mask will return a Dataset, right? Should this be a new Dataset, or is there a use case for also adding the new masked Sv to the source Sv dataset?

I would assume that it would output a new Dataset, so the original source could remain unmodified. However, we could always make this an input flag, say inplace.

provide an option to include the mask variable in the output Dataset; if multiple masks were passed, only the single, combined mask should be included.

I think including the mask in the output Dataset is not preferable. The masks are already an input of the function, so the user could already save them, if they want to. Additionally, attaching the mask would increase the data size. However, If we really want to make this an option, we could have a keyword, say include_mask. We could also include all masks individually that were passed and attach them as variables using their provided names (thinking about the size increase maybe the single combined mask is a better idea).

leewujung · 2022-12-14T06:39:36Z

Thanks @b-reyes @emiliom for the discussions above.

Input format:

For whether the input mask should be a DataArray or Dataset or both. My suggestion is that we do not overthink and implement something straightforward that works as a first pass. The masks we will be applying at this first pass will be generated by echopype and under our control (currently only the one from freq-diff that is very simplistic), and we also need to allow having those already saved as zarr somewhere.
I am fine with going the DataArray route, and think we should allow only DataArray and path (pathlib and str), and not worry about accommodating Dataset at this moment. Selecting a data variable is easy for the user to do and the input would look like something like [ds_mask1["maskA"], ds_mask2["maskB"], ds_mask3["maskC"]].

Remove noise data variable name:

these are the functions I have not "touched" yet in the overhaul, but I think Sv_corrected should just be named Sv and attached with proper provenance
@emiliom @b-reyes : I have not looked at these functions for a long time, so I would caution on taking the implementation and pattern as is
supplying a data variable name in the input of apply_mask to be masked is straightforward to do. I think we should allow that.

Output:

I think it is messy to store the masks with the output data and don't think we should provide this as an option.
Instead, I think we should have better provenance to document what is taken in and coming out.

I'll ping both of you off github to resolve these discussions.

b-reyes · 2022-12-14T18:33:08Z

Based on discussions between @leewujung and myself, we have arrived at the following for apply_mask:

It will be in the mask sub-package and thus be accessed as echopype.mask.apply_mask(...)
It will have the following function inputs:
- source_ds - a Dataset or a path to a Dataset that contains the variable we want to apply a mask to.
- var_name - a string corresponding to the variable name in source_ds that the mask should be applied to
- mask the actual mask to apply to the variable specified by var_name. This input can be a list or a single input that corresponds to a DataArray or a path to a DataArray
- fill_value - specifies the value(s) at false indices. This variable can be a scalar, array, Variable, or DataArray (see xr.where for descriptions of these types).
The output will be a Dataset with the same format of source_ds with the mask applied to var_name
- We will not save this output for them to a zarr or netcdf
- If source_ds is lazy, so will the output Dataset (likewise for in-memory)
Other items discussed:
- We will not store the final mask or any mask in the output Dataset
- When implementing the method we will use xr.where with x=source_ds[var_name] and y=fill_value

b-reyes · 2022-12-15T00:24:48Z

In the above comment I mention that we should let fill_value be a xr.Variable. After further investigation, I think we should not allow for this type. The reason for this is the documentation states: "However, manipulating data in the form of a Dataset or DataArray should almost always be preferred [in comparison to a xr.Variable], because they can use more complete metadata in context of coordinate labels."

leewujung · 2022-12-15T02:46:18Z

I agree with omitting xr.Variable.

emiliom · 2022-12-15T18:00:21Z

fill_value - specifies the value(s) at false indices. This variable can be a scalar, array, ~~Variable~~, or DataArray (see xr.where for descriptions of these types).

Just a comment that in the context of xarray (and netcdf/zarr) a "fill value" is typically (or always?) a scalar. It's intended to be literally a single value. I see that the types you've listed come directly from xr.where, which makes sense in the more general context of that function. But in the context of applying a mask, we're talking about a scalar. Or, if you actually intended something different, I suggest you use a name other than fill_value.

leewujung · 2022-12-15T18:14:01Z

@emiliom : yes we did discuss limiting it to only a scalar, but since we are using xr.where under the hood, seems a convenient to just allow a DataArray. In our own use of masks, it will just be NaN by default.

b-reyes · 2022-12-16T16:51:22Z

In addition to the inputs provided above for apply_mask, we will also need storage options for both the source_ds and mask inputs (as they can be paths). @leewujung and I have settled on storage_options_ds: dict = {} and storage_options_mask: Union[dict, List[dict]] = {} for source_ds and mask, respectively.

emiliom · 2023-03-18T01:32:16Z

Current state of individual modules or functions, existing and planned, organized by subpackage. The order of subpackages roughly reflects processing order.

TODO:

Add one-sentence descriptions of every subpackage category

(E) = Existing

convert (E)
- echodata.update_platform (E). This isn't a "convert" step per se, but it's closely related and it's important from a processing level perspective.
- Include here combine_echodata capability, conceptually?
qc
- coerce_increasing_time (E)
- exist_reversed_time (E)
calibrate (E)
consolidate (E), not in @leewujung's list above but included among the modules edited in this PR
- add_location (E)
- add_depth (E)
- add_splitbeam_angle (E)
- swap_dims_channel_frequency (E). This is a lower-level function that rearranges data a bit, doesn't create a new product or add new data per se. Move it to qc? (Update: It's not quite qc functionality; it's more about user friendliness)
clean: remove noise or reducing variability in the data
- median
- conv
- remove_noise (E)
- estimate_noise (E)
- mean_bkg
- spike
commongrid
- compute_MVBS (E)
- compute_MVBS_index_binning (E)
- compute_NASC (E). See Add function to produce NASC #730
- regrid_Sv. See Add function to regrid Sv data #726
mask
- apply_mask (E)
- frequency_differencing (E)
- from_labels
  - boundary
  - region
  - For bottom removal discussion, see Add function to remove bottom #729
metrics
- summary statistics. See initial (draft?) implementation at metrics/summary_statistics.py
  - delta_z (E)
  - convert_to_linear (E)
  - abundance (E)
  - center_of_mass (E)
  - dispersion (E)
  - evenness (E)
  - aggregation (E)

leewujung · 2024-02-25T21:51:12Z

I'll close this PR now since it is very outdated. Some of the functions mentioned here in prototype form have been implemented, and others either won't be implemented in echopype (those related to region and line labels under mask -- they are now under echoregions) or are in active PRs (those related to noise removal under clean).

leewujung and others added 11 commits September 17, 2022 00:13

Merge branch 'dev' of https://github.com/OSOceanAcoustics/echopype in…

296e139

…to dev

add conslidate.add_splitbeam_angle signature

213ab3d

fix typo

632c793

add filter subpackage signature

0a3f288

add filter subpackage signature

e1a13bc

add regrid signature and fix docstrings

2019ce9

add remove_noise signature to filter/api

fd26231

add filter/noise.py

30b1530

rename regrid subpackage to unify

fb7fec2

add echopype/mask

6d478f8

[pre-commit.ci] auto fixes from pre-commit.com hooks

4df4352

for more information, see https://pre-commit.ci

leewujung marked this pull request as draft September 17, 2022 21:31

leewujung added 2 commits September 17, 2022 17:33

fix long line

a989015

Merge branch 'add-subpkg' of https://github.com/leewujung/echopype in…

cfc548e

…to add-subpkg

leewujung added the design label Sep 18, 2022

leewujung and others added 3 commits September 26, 2022 10:12

Merge branch 'dev' into add-subpkg

53cf88e

add compute_NASC prototype signature

e628fe8

[pre-commit.ci] auto fixes from pre-commit.com hooks

d202321

for more information, see https://pre-commit.ci

lsetiawan mentioned this pull request Nov 10, 2022

Stages data processing levels OSOceanAcoustics/echodataflow#1

Closed

emiliom self-requested a review November 10, 2022 18:22

Merge branch 'dev' into add-subpkg

2036d9b

lsetiawan mentioned this pull request Nov 22, 2022

Config specifications OSOceanAcoustics/echodataflow#3

Closed

leewujung mentioned this pull request Nov 28, 2022

Create and move functions to commongrid and clean subpackages #889

Closed

leewujung mentioned this pull request Nov 28, 2022

Add add_splitbeam_angle function #890

Closed

b-reyes mentioned this pull request Dec 14, 2022

Implement frequency-differencing mask #901

Merged

b-reyes mentioned this pull request Dec 19, 2022

Add apply_mask function to mask sub-package #905

Merged

leewujung changed the base branch from old_dev to dev December 23, 2022 06:42

emiliom self-assigned this Mar 13, 2023

This was referenced Mar 13, 2023

Add global attribute for data processing level #980

Closed

Move preprocess functions and tests new clean and commongrid subpackages #993

Merged

This was referenced Mar 22, 2023

Implementation of data-processing-level attributes #1001

Merged

Priority documentation updates #973

Open

leewujung mentioned this pull request Sep 1, 2023

Add transient and impulse noise masks #1142

Closed

leewujung closed this Feb 25, 2024

leewujung deleted the add-subpkg branch July 21, 2024 00:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype subpackages along processing levels #817

Prototype subpackages along processing levels #817

leewujung commented Sep 17, 2022 •

edited

Loading

codecov-commenter commented Sep 26, 2022 •

edited

Loading

leewujung commented Sep 26, 2022

emiliom commented Oct 28, 2022 •

edited

Loading

b-reyes commented Dec 12, 2022

leewujung commented Dec 12, 2022

emiliom commented Dec 14, 2022

b-reyes commented Dec 14, 2022

leewujung commented Dec 14, 2022 •

edited

Loading

b-reyes commented Dec 14, 2022

b-reyes commented Dec 15, 2022

leewujung commented Dec 15, 2022

emiliom commented Dec 15, 2022

leewujung commented Dec 15, 2022

b-reyes commented Dec 16, 2022

emiliom commented Mar 18, 2023 •

edited by leewujung

Loading

leewujung commented Feb 25, 2024

Prototype subpackages along processing levels #817

Prototype subpackages along processing levels #817

Conversation

leewujung commented Sep 17, 2022 • edited Loading

codecov-commenter commented Sep 26, 2022 • edited Loading

Codecov Report

leewujung commented Sep 26, 2022

emiliom commented Oct 28, 2022 • edited Loading

b-reyes commented Dec 12, 2022

leewujung commented Dec 12, 2022

emiliom commented Dec 14, 2022

b-reyes commented Dec 14, 2022

leewujung commented Dec 14, 2022 • edited Loading

b-reyes commented Dec 14, 2022

b-reyes commented Dec 15, 2022

leewujung commented Dec 15, 2022

emiliom commented Dec 15, 2022

leewujung commented Dec 15, 2022

b-reyes commented Dec 16, 2022

emiliom commented Mar 18, 2023 • edited by leewujung Loading

leewujung commented Feb 25, 2024

leewujung commented Sep 17, 2022 •

edited

Loading

codecov-commenter commented Sep 26, 2022 •

edited

Loading

emiliom commented Oct 28, 2022 •

edited

Loading

leewujung commented Dec 14, 2022 •

edited

Loading

emiliom commented Mar 18, 2023 •

edited by leewujung

Loading