-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LiveSurvey Class #264
base: main
Are you sure you want to change the base?
LiveSurvey Class #264
Conversation
…a/echopop into WIP_LiveSurvey_class
for more information, see https://pre-commit.ci
# Database root directory | ||
database_root_directory = file_configuration["database_directory"] | ||
|
||
# Initialize the database file | ||
initialize_database(database_root_directory, file_settings) | ||
|
||
# Drop incomplete datasets | ||
if dataset == "biology": | ||
data_files = validate_complete_biology_dataset( | ||
data_files, directory_path, file_configuration | ||
) | ||
|
||
# Query the SQL database to process only new files (or create the db file in the first place) | ||
valid_files, file_configuration["database"][dataset] = query_processed_files( | ||
database_root_directory, file_settings, data_files | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section deviates from the goal to validate paths as suggested by the function name.
Probably better to separate these out and just have the necessary content in load_acoustic_data
and load_biological_data
. For example the section below is only needed in load_biological_data
# Drop incomplete datasets
if dataset == "biology":
data_files = validate_complete_biology_dataset(
data_files, directory_path, file_configuration
)
# Validate the data directory and format the filepaths | ||
acoustic_files = eldl.validate_data_directory( | ||
self.config, dataset="acoustics", input_filenames=input_filenames | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: only new files are in the returned acoustic_files
, because validate_data_directory
right now contains query_processed_files
that would exclude files that have already been processed.
# ---- Add the `acoustic_data_units` to the dictionary | ||
self.config["acoustics"]["dataset_units"] = acoustic_data_units | ||
# ---- Preprocess the acoustic dataset | ||
# TODO: SettingWithCopyWarning: | ||
self.input["acoustics"]["prc_nasc_df"] = preprocess_acoustic_data( | ||
prc_nasc_df.copy(), self.input["spatial"], self.config | ||
) | ||
# ---- Add meta key | ||
self.meta["provenance"].update( | ||
{ | ||
"acoustic_files_read": acoustic_files, | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These additions are ad-hoc and deviates from the idea behind a validated config model.
return biology_output | ||
|
||
|
||
def read_acoustic_zarr(file: Path, config_map: dict, xarray_kwargs: dict = {}) -> tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest changing this to nasc_zarr_to_df
so that it is clear what it does. Also to avoid the similarity with read_acoustics_files
that has acousticS and here the function name does not have an S...
"required_keys": ["frequency", "units"], | ||
"optional_keys": [], | ||
"keys": { | ||
"frequency": float, | ||
"units": ["Hz", "kHz"], | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is making the options too liberal...
# ---- Filter out any unused frequency coordinates | ||
prc_nasc_df_filtered = ( | ||
survey_data[survey_data["frequency_nominal"] == transmit_settings["frequency"]] | ||
# ---- Drop NaN/NaT values from longitude/latitude/ping_time | ||
.dropna(subset=["longitude", "latitude", "ping_time"]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be more efficient to do this frequency selection in read_acoustic_files
so that what you're doing in this function (preprocess_acoustic_data
) will be strictly related to spatial stuff (and then the function name can be made more specific also). Also no .drop(columns=["frequency_nominal"])
needed at the end.
) | ||
|
||
|
||
def apply_griddify_definitions(dataset: pd.DataFrame, spatial_config: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first half of apply_griddify_definitions
(up to # Convert to GeoDataFrame where you actually handles the acoustic data) could be separate out as a function that you run once in the init and store the output grids needed to "assign" the NASC data into.
# # ---- Create filepath object | ||
if "data_root_dir" in file_configuration: | ||
# directory_path = Path(file_configuration["data_root_dir"]) / file_settings["directory"] | ||
directory_path = "/".join([file_configuration["data_root_dir"], file_settings["directory"]]) | ||
else: | ||
directory_path = file_settings["directory"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious what the problem with using Path object is. Thinking that Path("") / file_settings["directory"]
would still give Path(file_settings["directory"]) so seems can eliminate the if-else
if isinstance(df, pd.DataFrame) and not df.empty | ||
} | ||
# ---- Create new data flag | ||
file_configuration["length_distribution"] = prepare_length_distribution(file_configuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addition of "length_distribution" in self.config
is probably not a good idea because you already have self.config["biology"]["length_distribution"]
. Since the config dict is defined upfront, I would suggest against adding undefined keys into it in other parts of the code. I think I made another comment also on this.
# ---- Incorporate additional data, if new data are present | ||
if filtered_biology_output: | ||
# ---- Merge the trawl information and app | ||
merge_trawl_info(filtered_biology_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you delete biology_dict["trawl_info_df"]
at the end of this function?
return length_bins_df | ||
|
||
|
||
def preprocess_biology_data(biology_output: dict, spatial_dict: dict, file_configuration: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need your help to go through the intention of this function, because it is quite hidden behind the many layers of functions that are very specific. I think there's a balance between reusability and readability that we should discuss. I also have an additional question on why we need a database if for new trawls we are basically just appending (accumulating) new data. Especially since at the end you inset the new data and then pull out the combined data. Wouldn't that be the same as reading a csv into dataframe and add new data to it? and then this new dataframe containing everything can be saved as the same csv filename.
return df_validated | ||
|
||
|
||
def infer_datetime_format(timestamp_str: Union[int, str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the patterns you included here are already covered in pandas guess_datetime_format? Since these seem to be pretty standard ones.
Also, do you need these because the files from FEAT have different datetime formats? For now this obviously works, but I feel we can communicate with Alicia to see if she could make those uniform.
def filter_filenames( | ||
directory_path: Path, filename_id: str, files: List[Path], file_extension: str | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use fsspec for flexible handling between different file systems. Also, it is a lot of work here to get something that is general across all the files, but the patterns are still pretty specific. For the scenario we are dealing with, I'd say generality is not the first priority, and more hard-coding may actually make the code more readable. Let's discuss this when we meet.
# Add population-specific columns (specified in the file configuration) | ||
# TODO: Add to `yaml` file for configuration; hard-code for now | ||
add_columns = ["number_density", "biomass_density"] | ||
# ---- | ||
df[add_columns] = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to make sure that I understand - these two columns are not supposed to be added here, and they are here as a temporary hardcoded fix?
strata_values = np.unique(nasc_biology_data["stratum"]).tolist() | ||
|
||
# Update the table | ||
sql_update_strata_summary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called in both biology_pipeline
and acoustic_pipeline
. Not sure why this update from acoustic_db
to biology_db
needs to happen? Is it for the sake of keeping summary in the biology_db
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interim summary from me reading the code - not sure where this should go so maybe just putting it here!
LiveSurvey.init()
LiveSurvey.load_acoustic_data
- Turn NASC zarr to dataframe
- Select the frequency wanted
- Assign x/y grid and stratum to the NASC dataframe entries
LiveSurvey.load_biology_data
- Combine multiple biological csv files
- Assign the hauls to specific stratum
- Insert new data into database and read back the whole dataset as dataframe
LiveSurvey.process_acoustic_data
- Operations will be skipped if no new data is present in
input['acoustics']['prc_nasc_df']
.compute_nasc
calls.integrate_nasc
to integrate NASC.integrate_nasc()
does 1) integrate NASC, and 2) calculates echometrics.format_acoustic_datset()
- Adds
number_density
andbiomass_density
intonasc_data_df
(but not computing them??) - Adds successfully processed files into the acoustics database
- Pulls out the entire combined dataset
- Adds
.acoustic_pipeline()
and.biology_pipeline()
basically does the same thing, it’s only the entries updated in the biological estimates database is different:- for
.acoustic_pipeline()
, only the specific grid(s) that is associated with the new NASC entries are updated - for
.biology_pipeline()
, all grids in the stratum for which the haul is located are updated .get_average_strata_weights()
computes the average strata weights within each of the acoustic or biology pipeline runs- Flow of calculation:
- Compute
number_density
fromnasc
andsigma_bs_mean
- Compute
biomass_density
fromnumber_density
andaverage_weight
summarize_strata()
updates thebiology_db
withnumber_density
andbiomass_density
mean from theacoustic_db
-- not sure why this is neededupdate_population_grid
: updates grid.db
- Compute
- for
- General flow of ops:
- New NASC data:
init()
-->load_acoustic_data()
-->process_acoustic_data()
-->estimate_population()
- New catch haul data:
init()
-->load_biology_data()
-->process_biology_data()
-->estimate_population()
- New NASC data:
for more information, see https://pre-commit.ci
…a/echopop into WIP_LiveSurvey_class
Draft PR for the
LiveSurvey
class rough draft.