LiveSurvey Class #264

brandynlucca · 2024-08-28T16:29:49Z

Draft PR for the LiveSurvey class rough draft.

…a/echopop into WIP_LiveSurvey_class

for more information, see https://pre-commit.ci

leewujung · 2024-08-28T17:37:17Z

echopop/live/live_data_loading.py

+    # Database root directory
+    database_root_directory = file_configuration["database_directory"]
+
+    # Initialize the database file
+    initialize_database(database_root_directory, file_settings)
+
+    # Drop incomplete datasets
+    if dataset == "biology":
+        data_files = validate_complete_biology_dataset(
+            data_files, directory_path, file_configuration
+        )
+
+    # Query the SQL database to process only new files (or create the db file in the first place)
+    valid_files, file_configuration["database"][dataset] = query_processed_files(
+        database_root_directory, file_settings, data_files
+    )


This section deviates from the goal to validate paths as suggested by the function name.
Probably better to separate these out and just have the necessary content in load_acoustic_data and load_biological_data. For example the section below is only needed in load_biological_data

# Drop incomplete datasets if dataset == "biology": data_files = validate_complete_biology_dataset( data_files, directory_path, file_configuration )

leewujung · 2024-08-28T17:44:51Z

echopop/live/live_survey.py

+        # Validate the data directory and format the filepaths
+        acoustic_files = eldl.validate_data_directory(
+            self.config, dataset="acoustics", input_filenames=input_filenames
+        )


Note to self: only new files are in the returned acoustic_files, because validate_data_directory right now contains query_processed_files that would exclude files that have already been processed.

leewujung · 2024-08-28T17:51:13Z

echopop/live/live_survey.py

+            # ---- Add the `acoustic_data_units` to the dictionary
+            self.config["acoustics"]["dataset_units"] = acoustic_data_units
+            # ---- Preprocess the acoustic dataset
+            # TODO: SettingWithCopyWarning:
+            self.input["acoustics"]["prc_nasc_df"] = preprocess_acoustic_data(
+                prc_nasc_df.copy(), self.input["spatial"], self.config
+            )
+            # ---- Add meta key
+            self.meta["provenance"].update(
+                {
+                    "acoustic_files_read": acoustic_files,
+                }
+            )


These additions are ad-hoc and deviates from the idea behind a validated config model.

leewujung · 2024-08-28T17:53:09Z

echopop/live/live_data_loading.py

+    return biology_output
+
+
+def read_acoustic_zarr(file: Path, config_map: dict, xarray_kwargs: dict = {}) -> tuple:


Suggest changing this to nasc_zarr_to_df so that it is clear what it does. Also to avoid the similarity with read_acoustics_files that has acousticS and here the function name does not have an S...

leewujung · 2024-08-28T17:58:46Z

echopop/live/live_core.py

+                    "required_keys": ["frequency", "units"],
+                    "optional_keys": [],
+                    "keys": {
+                        "frequency": float,
+                        "units": ["Hz", "kHz"],
+                    },


I feel this is making the options too liberal...

leewujung · 2024-08-28T18:37:10Z

echopop/live/live_acoustics.py

+    # ---- Filter out any unused frequency coordinates
+    prc_nasc_df_filtered = (
+        survey_data[survey_data["frequency_nominal"] == transmit_settings["frequency"]]
+        # ---- Drop NaN/NaT values from longitude/latitude/ping_time
+        .dropna(subset=["longitude", "latitude", "ping_time"])
+    )


I think it will be more efficient to do this frequency selection in read_acoustic_files so that what you're doing in this function (preprocess_acoustic_data) will be strictly related to spatial stuff (and then the function name can be made more specific also). Also no .drop(columns=["frequency_nominal"]) needed at the end.

leewujung · 2024-08-28T18:43:29Z

echopop/live/live_spatial_methods.py

+    )
+
+
+def apply_griddify_definitions(dataset: pd.DataFrame, spatial_config: dict):


I think the first half of apply_griddify_definitions (up to # Convert to GeoDataFrame where you actually handles the acoustic data) could be separate out as a function that you run once in the init and store the output grids needed to "assign" the NASC data into.

leewujung · 2024-08-29T03:05:57Z

echopop/live/live_data_loading.py

+    # # ---- Create filepath object
+    if "data_root_dir" in file_configuration:
+        # directory_path = Path(file_configuration["data_root_dir"]) / file_settings["directory"]
+        directory_path = "/".join([file_configuration["data_root_dir"], file_settings["directory"]])
+    else:
+        directory_path = file_settings["directory"]


I am curious what the problem with using Path object is. Thinking that Path("") / file_settings["directory"] would still give Path(file_settings["directory"]) so seems can eliminate the if-else

leewujung · 2024-08-29T03:55:29Z

echopop/live/live_biology.py

+        if isinstance(df, pd.DataFrame) and not df.empty
+    }
+    # ---- Create new data flag
+    file_configuration["length_distribution"] = prepare_length_distribution(file_configuration)


This addition of "length_distribution" in self.config is probably not a good idea because you already have self.config["biology"]["length_distribution"]. Since the config dict is defined upfront, I would suggest against adding undefined keys into it in other parts of the code. I think I made another comment also on this.

leewujung · 2024-08-29T04:00:05Z

echopop/live/live_biology.py

+    # ---- Incorporate additional data, if new data are present
+    if filtered_biology_output:
+        # ---- Merge the trawl information and app
+        merge_trawl_info(filtered_biology_output)


Why do you delete biology_dict["trawl_info_df"] at the end of this function?

leewujung · 2024-08-29T04:14:58Z

echopop/live/live_biology.py

+    return length_bins_df
+
+
+def preprocess_biology_data(biology_output: dict, spatial_dict: dict, file_configuration: dict):


I need your help to go through the intention of this function, because it is quite hidden behind the many layers of functions that are very specific. I think there's a balance between reusability and readability that we should discuss. I also have an additional question on why we need a database if for new trawls we are basically just appending (accumulating) new data. Especially since at the end you inset the new data and then pull out the combined data. Wouldn't that be the same as reading a csv into dataframe and add new data to it? and then this new dataframe containing everything can be saved as the same csv filename.

leewujung · 2024-08-29T04:19:34Z

echopop/live/live_data_loading.py

+    return df_validated
+
+
+def infer_datetime_format(timestamp_str: Union[int, str]):


I wonder if the patterns you included here are already covered in pandas guess_datetime_format? Since these seem to be pretty standard ones.

Also, do you need these because the files from FEAT have different datetime formats? For now this obviously works, but I feel we can communicate with Alicia to see if she could make those uniform.

leewujung · 2024-08-29T04:23:37Z

echopop/live/live_data_loading.py

+def filter_filenames(
+    directory_path: Path, filename_id: str, files: List[Path], file_extension: str
+):


I think we can use fsspec for flexible handling between different file systems. Also, it is a lot of work here to get something that is general across all the files, but the patterns are still pretty specific. For the scenario we are dealing with, I'd say generality is not the first priority, and more hard-coding may actually make the code more readable. Let's discuss this when we meet.

leewujung · 2024-08-29T17:49:30Z

echopop/live/live_acoustics.py

+    # Add population-specific columns (specified in the file configuration)
+    # TODO: Add to `yaml` file for configuration; hard-code for now
+    add_columns = ["number_density", "biomass_density"]
+    # ----
+    df[add_columns] = 0.0


Just trying to make sure that I understand - these two columns are not supposed to be added here, and they are here as a temporary hardcoded fix?

echopop/live/live_data_processing.py

leewujung · 2024-08-29T18:57:08Z

echopop/live/live_biology.py

+    strata_values = np.unique(nasc_biology_data["stratum"]).tolist()
+
+    # Update the table
+    sql_update_strata_summary(


This is called in both biology_pipeline and acoustic_pipeline. Not sure why this update from acoustic_db to biology_db needs to happen? Is it for the sake of keeping summary in the biology_db?

leewujung

Interim summary from me reading the code - not sure where this should go so maybe just putting it here!

LiveSurvey.init()

LiveSurvey.load_acoustic_data

Turn NASC zarr to dataframe
Select the frequency wanted
Assign x/y grid and stratum to the NASC dataframe entries

LiveSurvey.load_biology_data

Combine multiple biological csv files
Assign the hauls to specific stratum
Insert new data into database and read back the whole dataset as dataframe

LiveSurvey.process_acoustic_data

Operations will be skipped if no new data is present in input['acoustics']['prc_nasc_df']
.compute_nasc calls .integrate_nasc to integrate NASC
.integrate_nasc() does 1) integrate NASC, and 2) calculates echometrics
.format_acoustic_datset()
- Adds number_density and biomass_density into nasc_data_df (but not computing them??)
- Adds successfully processed files into the acoustics database
- Pulls out the entire combined dataset
.acoustic_pipeline() and .biology_pipeline() basically does the same thing, it’s only the entries updated in the biological estimates database is different:
- for .acoustic_pipeline(), only the specific grid(s) that is associated with the new NASC entries are updated
- for .biology_pipeline(), all grids in the stratum for which the haul is located are updated
- .get_average_strata_weights() computes the average strata weights within each of the acoustic or biology pipeline runs
- Flow of calculation:
  - Compute number_density from nasc and sigma_bs_mean
  - Compute biomass_density from number_density and average_weight
  - summarize_strata() updates the biology_db with number_density and biomass_density mean from the acoustic_db -- not sure why this is needed
  - update_population_grid: updates grid.db
General flow of ops:
- New NASC data: init() --> load_acoustic_data() --> process_acoustic_data() --> estimate_population()
- New catch haul data: init() --> load_biology_data() --> process_biology_data() --> estimate_population()

for more information, see https://pre-commit.ci

…a/echopop into WIP_LiveSurvey_class

brandynlucca added 30 commits July 15, 2024 08:44

Establishing inital LiveSurvey

b56b618

Initial data loading function refactoring

00c898d

Updated methods

69340e7

Updated methods/processing (plus SQL)

3adf521

Updating LiveSurvey methods

9b79d81

General changes

382d444

Commited SQL changes

2374ce9

Reorganize loading functions

7f49f31

General changes

6d439bb

Format some changes to methods

a4a51a6

Quick patch

e395405

Fleshed out biology processing methods

2f04cab

Further refinement of process_biology_data meth

c95cf8d

Complete biology processing code

4e4ca87

More changes to methods

d0f4208

Full drafted workflow

6020f79

Patches

1ee2e20

Cleaned up test_workflow

0f31b20

YAML config settings adjustment for db dir

40f3d7b

f-string fix for coastline db file creation

63e7961

Fix to stratum/spatial config key name

ab6d9ff

Fix to database directory initialization

8dd470c

Additional db directorypath changes/fixes

b6fbae5

Fix data_root_dir missing workaround

6c6214f

db pathing issues fixed

d1bdc2c

data_root_dir check for read_biology_files

3e252a0

Gridding methods

a1cec01

Grid fix

dd87bc9

Add xarray kwargs options

8693641

Merge branch 'WIP_LiveSurvey_class' of https://github.com/brandynlucc…

6af7aa5

…a/echopop into WIP_LiveSurvey_class

brandynlucca and others added 9 commits August 21, 2024 12:24

Fixed ship_id f-string issue

50937f1

Fixes to odd SQL table column shuffling

f324647

Minor improvements to visualizer code

f0f8001

New configuration file validator

5fcc0c9

Data reading validators

7db9764

Clarified config validation error messages

cef3036

Pre-commit formatting changes

0aa88ac

Pruned test_workflow.py

218df8a

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a0eac8

for more information, see https://pre-commit.ci

leewujung reviewed Aug 28, 2024

View reviewed changes

leewujung reviewed Aug 29, 2024

View reviewed changes

echopop/live/live_data_processing.py Show resolved Hide resolved

leewujung reviewed Aug 29, 2024

View reviewed changes

Sohambutala and others added 4 commits October 27, 2024 02:46

add echopop live viz cmap and fig seq tweaks

2a277a6

[pre-commit.ci] auto fixes from pre-commit.com hooks

30c05ff

for more information, see https://pre-commit.ci

tweak zorder of scatter and line in track plot, tweak units

99a91a7

Merge branch 'WIP_LiveSurvey_class' of https://github.com/brandynlucc…

25ee616

…a/echopop into WIP_LiveSurvey_class

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LiveSurvey Class #264

LiveSurvey Class #264

brandynlucca commented Aug 28, 2024 •

edited

Loading

leewujung Aug 28, 2024 •

edited

Loading

leewujung Aug 28, 2024

leewujung Aug 28, 2024

leewujung Aug 28, 2024

leewujung Aug 28, 2024

leewujung Aug 28, 2024 •

edited

Loading

leewujung Aug 28, 2024

leewujung Aug 29, 2024

leewujung Aug 29, 2024

leewujung Aug 29, 2024

leewujung Aug 29, 2024 •

edited

Loading

leewujung Aug 29, 2024

leewujung Aug 29, 2024

leewujung Aug 29, 2024

leewujung Aug 29, 2024 •

edited

Loading

leewujung left a comment

		return biology_output


		def read_acoustic_zarr(file: Path, config_map: dict, xarray_kwargs: dict = {}) -> tuple:

		)


		def apply_griddify_definitions(dataset: pd.DataFrame, spatial_config: dict):

		return length_bins_df


		def preprocess_biology_data(biology_output: dict, spatial_dict: dict, file_configuration: dict):

		return df_validated


		def infer_datetime_format(timestamp_str: Union[int, str]):

LiveSurvey Class #264

Are you sure you want to change the base?

LiveSurvey Class #264

Conversation

brandynlucca commented Aug 28, 2024 • edited Loading

leewujung Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leewujung Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leewujung Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leewujung Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

leewujung left a comment

Choose a reason for hiding this comment

brandynlucca commented Aug 28, 2024 •

edited

Loading

leewujung Aug 28, 2024 •

edited

Loading

leewujung Aug 28, 2024 •

edited

Loading

leewujung Aug 29, 2024 •

edited

Loading

leewujung Aug 29, 2024 •

edited

Loading