-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incomplete CMIP6 data on Google Cloud #2
Comments
Yes, we can do that. It might take some time, but this is within my job description hehe. Could you compile a list of instance_ids (this package might be helpful) of the models that are still needed? That would help tremendously with this. |
I created a script here: https://github.com/Timh37/CMIP6cf/blob/main/list_missing_cmip6_files_cloud.py. If you want me to condense this to the most needed experiments & models I can do that. |
Excellent. Could you post the output (e.g. list of instance_ids) here? If you have a preferred subset, that would also be helpful. |
For the following instance_id's, runs are not missing but incomplete (i.e., files for some years are available but not for all years):
|
For now, I have limited the instance_id list of missing files to the
NB: I have received |
Thanks @Timh37 this is great. Getting these up on the cloud will still take a bit, but I can try to prioritize some of these. Lets chat about this later today. |
Ok, so here's a condensed list of 19 instances that'd be really great to have. With these we would have all available realizations of
|
Update of instance_id's of CMIP6 models providing daily mean sfcWind, psl and pr not yet available on Google Cloud (@jbusecke): ACCESS-CM2
ACCESS-ESM1-5
EC-Earth3 (all because most have missing timesteps)
EC-Earth3-Veg
FGOALS-g3
HadGEM3-GC31-LL
HadGEM3-GC31-MM
IPSL-CM6A-LR
KIOST-ESM
MIROC6
MPI-ESM1-2-LR
MRI-ESM2-0
UKESM1-0-LL
|
Starting a first test of the UKESM1-0-LL data here. |
Hey @Timh37, I just wanted to loop you in here (I realize that leap-stc/cmip6-leap-feedstock#9 is quite a lot to follow haha). | instance_id | gs_store_path| The reason why there will be two of those is the following: A lot of datasets did not pass my rudimentary testing - many of them have for instance gaps in time). So I will provide you with one dataframe of 'quality controlled instances' and another one with 'the data was written, but might be funky' instances. I am not sure if you want to spend the time seeing if you can rescue some of the data (this would involve digging through the errata and probably working on each individual dataset - a task that we cannot fulfill on our end). On a general note: Finding many datasets with issues here is maybe not surprising, as I think that this might be the reason they were not processed by our older logic in the first place. |
@jbusecke I immensely appreciate this amazing effort on your and @cisaacstern's end, thank you! It's a pity that many iids are funky. For my understanding - are these datasets troublesome to process due to issues in the workflow that may need further attention, and/or would you expect part of this data to be incomplete on ESGF as well? I guess how helpful the additional iids will be depends on which models they will allow us to populate (especially with regards to the few large-ensemble models), which we will probably find out soon then. If large sets of data will still be missing on the cloud but not on ESGF and prevent us from doing certain analyses, it may be helpful if we could discuss how best to explain this in the paper. Thanks! |
AFAICT these issues are outside of our workflow and would also exist when using netcdfs, basically these datasets are broken! I think we can find a way to point to these efforts here and say we only use the datasets that pass the given standards. |
Copied above list of iid's to track if/which data is still missing after adding the new datasets added by @jbusecke. (Most of) these seem to be complete and downloadable from ESGF. ACCESS-CM2
ACCESS-ESM1-5
EC-Earth3
**EC-Earth3-Veg
FGOALS-g3
HadGEM3-GC31-LL
HadGEM3-GC31-MM
IPSL-CM6A-LR
KIOST-ESM
MIROC6
MPI-ESM1-2-LR
MRI-ESM2-0
UKESM1-0-LL
|
@jbusecke Thanks a lot for providing the new CMIP6 data catalogue! I am actively working on ingesting these. I started with checking the simulations in the new catalogue against the list of missing iid's, see my post above: #2 (comment). As expected, quite a few missing iids are now added, which is really great, but a large part is still missing as well. Strangely, many of these seem to be complete in terms of timesteps and I also seem to be able to succesfully download many of these from ESGF. There seems to be a pattern in that for the majority of iid's that are still missing, complete datasets are available for download at some ESGF nodes, but the same datasets are incomplete, unavailable or retracted at other nodes. Could this be problematic for the cmip6-leap-feedstock and cause these iid's to be omitted? While the iid's that have been succesfully added are really very helpful and much appreciated, they allow me to populate most of only one additional large-ensemble model (ACCESS-ESM1-5). Having compete datasets for the other large-ensemble models MPI-ESM1-2LR, EC-Earth3 and MIROC6 would be really helpful for the study. |
@jbusecke I was able to load in the CMIP6 data from the existing and new data catalogues (using https://github.com/Timh37/CMIP6cex/blob/main/cmip6_processing/ingest_CMIP6_data_from_two_cats.ipynb), although in a hacky way as I needed to apply the 'require_all_on' function of intake-esm after combining the two catalogues. By combining the old and new catalogues, however, duplicate datasets are introduced that differ only in their version. Do you have any suggestions for how to handle this? Versions tend to differ between experiments and sometimes also between variables, and I am not sure what logic best applies? |
Awesome progress @Timh37! Glad this is helping. FYI, every time I run these I see a few more 'getting through', so I strongly suspect with a bit of time, the numbers will improve. This really hints at availability issues on the data nodes...
Whenever there is a newer version (with all other facets the same), you should use the newer, and discard the older one. This is what is done in each catalog internally too btw.
You could manually append the two dataframes and create a new esm_collection that should do that for you. See here. If that works, that would be the best way to handle things consistently. You can always check the
Since I query the live ESGF API each time the recipes are resubmitted I would hope these things would figure themselves out eventually. If this is not solved in a few days, lets talk quickly and maybe you could try to help out debugging pangeo-forge-esgf with some concrete usecases?
I will try to investigate these iids in the next days... in the meantime you could check in yet another catalog (I know this has to be cleaned up) if these datasets might have not passed the quality control. You can access it like this. For now this catalog includes both the qc and the non-qc files, so you will have to do some filtering to identify the cases that did not pass the qc. In the future this should be done on our side (tracking this here) |
BTW this would explain why none of the MPI ones have made it so far! |
Also hopeful that this will give us some more successful writes! |
That sounds sensible, although would it be problematic if different variables of the same simulation would end up having different versions? Not sure if this happens, it might be ok if this happens if those variables weren't affected by the issue that necessitated a newer version to be uploaded.
Yes, I'm using def search_apply_require_all_on(
*,
df: pd.DataFrame,
query: dict[str, typing.Any],
require_all_on: typing.Union[str, list[typing.Any]],
columns_with_iterables: set = None,
) -> pd.DataFrame:
_query = query.copy()
for column in require_all_on:
_query.pop(column, None)
keys = list(_query.keys())
grouped = df.groupby(require_all_on)
values = [tuple(v) for v in _query.values()]
condition = set(itertools.product(*values))
query_results = []
for _, group in grouped:
group_for_index = group
# Unpack iterables to get testable index.
for column in (columns_with_iterables or set()).intersection(keys):
group_for_index = unpack_iterable_column(group_for_index, column)
index = group_for_index.set_index(keys).index
if not isinstance(index, pd.MultiIndex):
index = {(element,) for element in index.to_list()}
else:
index = set(index.to_list())
if condition.issubset(index): # with iterables we could have more then requested
query_results.append(group)
if query_results:
return pd.concat(query_results).reset_index(drop=True)
return pd.DataFrame(columns=df.columns)
search_apply_require_all_on(df=ccombined_cat.df,query=my_query,require_all_on=['source_id', 'member_id','grid_label'])
Can definitely do that. Checked just now I think that catalogue only includes the following, though: |
That seems right to me.
Why not? Swapping the dataframe should not disable any of the search capabilities AFAIK. Do you have an example of that failing? |
@jbusecke At first glance not seeing many additional complete simulations. I will investigate more closely soon. |
Updated list of missing iids: ACCESS-CM2
ACCESS-ESM1-5
EC-Earth3
FGOALS-g3
KIOST-ESM
MIROC6
MPI-ESM1-2-LR
MRI-ESM2-0
UKESM1-0-LL
|
I did mix in a bunch of other requests, which might have made up the brunt of new datasets. The pipline is currently broken due to some dependency problems, but I am hopeful to revive it soon. Thanks for sticking with us during these growing pains. |
If you have certain suspicions about a particular IID actually being available it would be helpful if you could run the following:
and after that install in a python environment import logging
import pangeo_forge_esgf
from pangeo_forge_esgf import get_urls_from_esgf, setup_logging
iid = 'the.suspicious.iid'
iids = [iid]
setup_logging('DEBUG')
url_dict = await get_urls_from_esgf(iids)
urls = url_dict[iid]
urls if there are discrepancies, this might help me to debug pangeo-forge-esgf (which is currently reporting many of the iids you want as unavailable). |
Reporting my conclusions after running that, below:
|
Many thanks for digging into this @Timh37. As discussed on slack, I redid the way we select a url if multiples are found per file, which seems to give us a LOT more datasets. so I think we can put the debugging on hold for now and see what we get done over the weekend. Some quick comments.
This is a data node, but the only thing we are specifying are search nodes. Ideally we only ever needed one search node, because they can perform a distributed search, but I have found that results sometimes vary, so I am being extra cautious here.
Not necessarily, it depends on whether this produces gaps in the final dataset (that would cause the datasets to fail). |
@jbusecke with the new data added I have discovered a few more incomplete datasets that weren't flagged before. Could these please be added to the list of iid's? Thanks.
also noting that a few historical EC-Earth3 files that have been added to the leap-persistent test catalogue are not complete despite complete availability at all available nodes:
|
@jbusecke please note that the issue with ACCESS-ESM1-5 still persists, causing some datasets to be dropped because they start later than 2015. The same thing happens for many many datasets of EC-Earth3. For example, see |
@jbusecke do you think this is something that could be solved in the coming months or should I not expect many more datasets? |
I am not sure at this point. I can certainly try to investigate this a bit deeper, but that would have to wait until mid-late Nov. Do you mind pinging me about this again then? |
I am finding quite some models for which part of the variables or variants I would like to include are not available on Google Cloud but are on ESGF. Can these somehow be added?
The text was updated successfully, but these errors were encountered: