Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back-testing previous survey years to ensure compatibility #307

Open
5 of 15 tasks
brandynlucca opened this issue Dec 4, 2024 · 1 comment
Open
5 of 15 tasks

Back-testing previous survey years to ensure compatibility #307

brandynlucca opened this issue Dec 4, 2024 · 1 comment

Comments

@brandynlucca
Copy link
Collaborator

brandynlucca commented Dec 4, 2024

The Echopop codebase was built using the 2019 data. There are known issues that will require changes for older survey years, but some unforeseen changes may also be necessary for more recent survey years. This a running list of survey years that have been confirmed to work when processing them within Echopop:

  • 2023
  • 2021
  • 2019
  • 2017
  • 2015
  • 2013
  • 2012
  • 2011
  • 2009
  • 2007
  • 2005
  • 2003
  • 2001
  • 1998
  • 1995
@brandynlucca
Copy link
Collaborator Author

brandynlucca commented Dec 6, 2024

Changelog for each year

🕵️ = activate investigation
⚠️ = issue that occurs on the user's end that must be resolved prior to being loaded ingested by Echopop
Last edit: 13 December 2024 14:28 PT

2023 ✅

  • Several strings required for NAME_CONFIG [f8667b]:
    cluser name, frequency, species_code, strata_index, strata index, weight_in_haul
  • Updated how lower- and uppercase column names are handled due to inconsistent usage implemented in Minor bug fixes #308
  • Overlapping stratum assignments break the code throughout the entire workflow and will require significant fixes to enable things to run. Echopop now expects to read in the INPFC dataset from both the strata and geostrata *.xlsx sheets defined in the fileset configuration *.yaml files [7bd02c0]

2021 ✅

  • A re-indexing change [bb95f1b] was required for distributing stratified abundance and biomass estimates over the weight proportions during the apportionment step in Survey.kriging_analysis
  • When stratum="inpfc" and exclude_age1=True, the following warning is raised when the nasc_to_biomass() function is ran within Survey.transect_analysis(): FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. Along with the changes in [7bd02c0], updates to how geographic binning is handled corrects strata values that were replaced with NaN [6c0acd8]

2017 ✅

  • Several strings required for NAME_CONFIG [0e512e0]:
    cluster number, haul start, inpfc

2015 ❌

  • $\textcolor{orange}{\textsf{When defining the stratification files used in 'EchoPro' for this dataset, the "US\&CAN strata 2015 shoreside.xlsx" file has}}$ $\textcolor{orange}{\textsf{missing column names. This renders 'Echopop' inoperable and correctly raises a validation error when trying to ingest}}$ $\textcolor{orange}{\textsf{this file}}$ ⚠️
  • Inconsistent handling of whitespace in certain files (e.g. 2015_biodata_catch.xlsx) are not truly empty but contain empty spaces parsed as strings like " ". This causes the pandera validators to (correctly) flag these as invalid entries since these validators are unable to coerce the values into their expected datatypes (e.g. float, int) [3db6070]
  • The pandera validator has been updated to correctly parse column names and explicitly searches for exact matches when regex=False for annotations [96ec6c]
  • Acoustic Echoview interval *.csv files have multiple longitude/latitude columns (i.e. Lon_S/Lon_E and Lat_S/Lat_E), which causes issues during the validation step because only a single longitude/latitude column is expected. This has been fixed by narrowing the allowable longitude/latitude column names (i.e. Lon_S and Lat_S); however, this may not work if other column names (i.e. _M and _E are used in the future) [b62b9c8]
  • The presence of non-*.csv files among the raw files raises an error since the file-in-question (V160-S201507-X2-F38-T49-Z0- (cells)_original.csv_bak) is still identified as having the correct file extension, when it does not. This has been rectified by addressing how *.csv files are parsed from the defined directory [d4141f6]
  • Transect numbers cross multiple INPFC strata, which yields issues when indexing the transect data for the random stratified analysis [8de8bd3]
  • An Error is raised when stratum_inpfc=7 due to there only being a single transect present within the stratum [8de8bd3]
  • A deprecation warning (FutureWarning) is raised by pandas when running Survey.stratified_analysis(dataset="kriging"): FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. [e05b1fc]
  • Certain Echoview *.csv file exports fail to be ingested and are correctly flagged by the pandera validator for missing columns (namely: those corresponding to vl_start and vl_end, e.g. V160-S201507-X2-F38-T40-Z0- (intervals).csv). These files instead comprise Dist_S and Dist_E, which does not comport with the expected columns. Computing transect interval distances from these columns appear to roughly match those from vessel log distances (~0.5 nmi), so this may be a case of simply updating the validator to look for column combinations of either vl_start/vl_end or dist_s/dist_e 🕵️
  • Some files (e.g. US&CAN strata 2015 shoreside.xlsx have erroneous spaces interspersed or appended to particular column names, which raises major issues for appropriately validating column names and data. This can be hard-coded in the translation dictionary, but this will need to be something handled on the user-side of things in the future [4ec9524]

2013 ✅

  • Inconsistent strata presence/indices causes some issues when running Survey.transect_analysis [1470d8b]
  • Acoustic data loading and consolidation produces a transect dataset where all NASC = 0.0 with transect regions not being correctly parsed. This is due to the expected Region_name formatting not being present in the Echoview export *.csv cell files. This therefore will require a different approach in how the correct region classes are mapped to the associated region names [f373899]
  • Issue with the variogram/kriging parameterization file breaks both Survey.fit_variogram and Survey.kriging_analysis. This results from an odd mismatch in the computed stratum-specific values that yield NaN values when merged with the transect data. The root cause appears to be a change in how transect intervals where NASC = 0.0 are assigned KS strata values. Transect data from 2015 and later automatically assigned empty cells with stratum_num=1; however, these values are now assigned stratum_num=0 for transect data 2013 and earlier. This was addressed similar to the stratification issue highlighted for the 2021 survey year [6c0acd8]

2012 ✅

  • Acoustic data loading and consolidation produces a transect dataset where all NASC = 0.0 with transect regions not being correctly parsed. This is due to the expected Region_name formatting not being present in the Echoview export *.csv cell files. This therefore will require a different approach in how the correct region classes are mapped to the associated region names [f373899]
  • $\textcolor{orange}{\textsf{The transect-region-haul mapping file required for ingesting the Echoview exports called by 'EchoPro' is a completely}}$ $\textcolor{orange}{\textsf{different format than in subsequent years (i.e. 2013 and beyond). When using the individual files like in later years, the}}$ $\textcolor{orange}{\textsf{ age-1+ files contain trawl information; however, the age-2+ file has a completely empty column for the trawl numbers.}}$ $\textcolor{orange}{\textsf{ For testing purposes, this has been filled with the values from the age-1+ file since the region names, etc., are identical.}}$ $\textcolor{orange}{\textsf{ However, this will correctly raise an Error when attempted otherwise}}$ ⚠️
  • Some messy aspects in the acoustic data yield virtual transects with non-functional distances that have mixed transect-strata indices when running Survey.stratified_analysis(dataset="kriging") [3dec490]

2011 ❌

  • Sheets like Stratification_geographic_Lat_rev have ambiguous column names (i.e. "Latitude") that can be found in other files. However, Echopop expects the column to be Latitude (upper limit) or northlimit_latitude. Consequently, the pandera validator will correctly raise an Error when this file is read in 🕵️
  • The sheet names for these stratification files are also inconsistent with later surveys. In this case, instead of sheets having names like "Base KS"/"stratification1" and "INPFC", they instead have ones like "stratification #0 (INPFC)" and "stratification #1". This mostly causes issue for the INPFC, which expected the sheetname to be either "INPFC" or "inpfc". Now any sheet with "INPFC" in the name will be read in; however, it must contain the "inpfc" string [8a33eb8]
  • Overlapping INPFC latitude limits must monotonically increase [c8fa839]
  • A completely different haul-transect-region key assignment that is inconsistent with later years 🕵️

2009 🔜

2007 🔜

2005 🔜

2003 🔜

2001 🔜

1998 🔜

1995 🔜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant