Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LiveSurvey Class #264

Draft
wants to merge 83 commits into
base: main
Choose a base branch
from

Conversation

brandynlucca
Copy link
Collaborator

@brandynlucca brandynlucca commented Aug 28, 2024

Draft PR for the LiveSurvey class rough draft.

Comment on lines +307 to +322
# Database root directory
database_root_directory = file_configuration["database_directory"]

# Initialize the database file
initialize_database(database_root_directory, file_settings)

# Drop incomplete datasets
if dataset == "biology":
data_files = validate_complete_biology_dataset(
data_files, directory_path, file_configuration
)

# Query the SQL database to process only new files (or create the db file in the first place)
valid_files, file_configuration["database"][dataset] = query_processed_files(
database_root_directory, file_settings, data_files
)
Copy link
Member

@leewujung leewujung Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section deviates from the goal to validate paths as suggested by the function name.
Probably better to separate these out and just have the necessary content in load_acoustic_data and load_biological_data. For example the section below is only needed in load_biological_data

# Drop incomplete datasets
    if dataset == "biology":
        data_files = validate_complete_biology_dataset(
            data_files, directory_path, file_configuration
        )

Comment on lines +124 to +127
# Validate the data directory and format the filepaths
acoustic_files = eldl.validate_data_directory(
self.config, dataset="acoustics", input_filenames=input_filenames
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: only new files are in the returned acoustic_files, because validate_data_directory right now contains query_processed_files that would exclude files that have already been processed.

Comment on lines +136 to +148
# ---- Add the `acoustic_data_units` to the dictionary
self.config["acoustics"]["dataset_units"] = acoustic_data_units
# ---- Preprocess the acoustic dataset
# TODO: SettingWithCopyWarning:
self.input["acoustics"]["prc_nasc_df"] = preprocess_acoustic_data(
prc_nasc_df.copy(), self.input["spatial"], self.config
)
# ---- Add meta key
self.meta["provenance"].update(
{
"acoustic_files_read": acoustic_files,
}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These additions are ad-hoc and deviates from the idea behind a validated config model.

return biology_output


def read_acoustic_zarr(file: Path, config_map: dict, xarray_kwargs: dict = {}) -> tuple:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing this to nasc_zarr_to_df so that it is clear what it does. Also to avoid the similarity with read_acoustics_files that has acousticS and here the function name does not have an S...

Comment on lines +43 to +48
"required_keys": ["frequency", "units"],
"optional_keys": [],
"keys": {
"frequency": float,
"units": ["Hz", "kHz"],
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is making the options too liberal...

Comment on lines +48 to +53
# ---- Filter out any unused frequency coordinates
prc_nasc_df_filtered = (
survey_data[survey_data["frequency_nominal"] == transmit_settings["frequency"]]
# ---- Drop NaN/NaT values from longitude/latitude/ping_time
.dropna(subset=["longitude", "latitude", "ping_time"])
)
Copy link
Member

@leewujung leewujung Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be more efficient to do this frequency selection in read_acoustic_files so that what you're doing in this function (preprocess_acoustic_data) will be strictly related to spatial stuff (and then the function name can be made more specific also). Also no .drop(columns=["frequency_nominal"]) needed at the end.

)


def apply_griddify_definitions(dataset: pd.DataFrame, spatial_config: dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first half of apply_griddify_definitions (up to # Convert to GeoDataFrame where you actually handles the acoustic data) could be separate out as a function that you run once in the init and store the output grids needed to "assign" the NASC data into.

Comment on lines +147 to +152
# # ---- Create filepath object
if "data_root_dir" in file_configuration:
# directory_path = Path(file_configuration["data_root_dir"]) / file_settings["directory"]
directory_path = "/".join([file_configuration["data_root_dir"], file_settings["directory"]])
else:
directory_path = file_settings["directory"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious what the problem with using Path object is. Thinking that Path("") / file_settings["directory"] would still give Path(file_settings["directory"]) so seems can eliminate the if-else

if isinstance(df, pd.DataFrame) and not df.empty
}
# ---- Create new data flag
file_configuration["length_distribution"] = prepare_length_distribution(file_configuration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addition of "length_distribution" in self.config is probably not a good idea because you already have self.config["biology"]["length_distribution"]. Since the config dict is defined upfront, I would suggest against adding undefined keys into it in other parts of the code. I think I made another comment also on this.

# ---- Incorporate additional data, if new data are present
if filtered_biology_output:
# ---- Merge the trawl information and app
merge_trawl_info(filtered_biology_output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you delete biology_dict["trawl_info_df"] at the end of this function?

return length_bins_df


def preprocess_biology_data(biology_output: dict, spatial_dict: dict, file_configuration: dict):
Copy link
Member

@leewujung leewujung Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need your help to go through the intention of this function, because it is quite hidden behind the many layers of functions that are very specific. I think there's a balance between reusability and readability that we should discuss. I also have an additional question on why we need a database if for new trawls we are basically just appending (accumulating) new data. Especially since at the end you inset the new data and then pull out the combined data. Wouldn't that be the same as reading a csv into dataframe and add new data to it? and then this new dataframe containing everything can be saved as the same csv filename.

return df_validated


def infer_datetime_format(timestamp_str: Union[int, str]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the patterns you included here are already covered in pandas guess_datetime_format? Since these seem to be pretty standard ones.

Also, do you need these because the files from FEAT have different datetime formats? For now this obviously works, but I feel we can communicate with Alicia to see if she could make those uniform.

Comment on lines +98 to +100
def filter_filenames(
directory_path: Path, filename_id: str, files: List[Path], file_extension: str
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use fsspec for flexible handling between different file systems. Also, it is a lot of work here to get something that is general across all the files, but the patterns are still pretty specific. For the scenario we are dealing with, I'd say generality is not the first priority, and more hard-coding may actually make the code more readable. Let's discuss this when we meet.

Comment on lines +302 to +306
# Add population-specific columns (specified in the file configuration)
# TODO: Add to `yaml` file for configuration; hard-code for now
add_columns = ["number_density", "biomass_density"]
# ----
df[add_columns] = 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to make sure that I understand - these two columns are not supposed to be added here, and they are here as a temporary hardcoded fix?

strata_values = np.unique(nasc_biology_data["stratum"]).tolist()

# Update the table
sql_update_strata_summary(
Copy link
Member

@leewujung leewujung Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called in both biology_pipeline and acoustic_pipeline. Not sure why this update from acoustic_db to biology_db needs to happen? Is it for the sake of keeping summary in the biology_db?

Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interim summary from me reading the code - not sure where this should go so maybe just putting it here!

LiveSurvey.init()

LiveSurvey.load_acoustic_data

  • Turn NASC zarr to dataframe
  • Select the frequency wanted
  • Assign x/y grid and stratum to the NASC dataframe entries

LiveSurvey.load_biology_data

  • Combine multiple biological csv files
  • Assign the hauls to specific stratum
  • Insert new data into database and read back the whole dataset as dataframe

LiveSurvey.process_acoustic_data

  • Operations will be skipped if no new data is present in input['acoustics']['prc_nasc_df']
  • .compute_nasc calls .integrate_nasc to integrate NASC
  • .integrate_nasc() does 1) integrate NASC, and 2) calculates echometrics
  • .format_acoustic_datset()
    • Adds number_density and biomass_density into nasc_data_df (but not computing them??)
    • Adds successfully processed files into the acoustics database
    • Pulls out the entire combined dataset
  • .acoustic_pipeline() and .biology_pipeline() basically does the same thing, it’s only the entries updated in the biological estimates database is different:
    • for .acoustic_pipeline(), only the specific grid(s) that is associated with the new NASC entries are updated
    • for .biology_pipeline(), all grids in the stratum for which the haul is located are updated
    • .get_average_strata_weights() computes the average strata weights within each of the acoustic or biology pipeline runs
    • Flow of calculation:
      • Compute number_density from nasc and sigma_bs_mean
      • Compute biomass_density from number_density and average_weight
      • summarize_strata() updates the biology_db with number_density and biomass_density mean from the acoustic_db -- not sure why this is needed
      • update_population_grid: updates grid.db
  • General flow of ops:
    • New NASC data: init() --> load_acoustic_data() --> process_acoustic_data() --> estimate_population()
    • New catch haul data: init() --> load_biology_data() --> process_biology_data() --> estimate_population()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants