Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry About "auraro" Model Detail #47

Open
qqydss opened this issue Oct 24, 2024 · 3 comments
Open

Inquiry About "auraro" Model Detail #47

qqydss opened this issue Oct 24, 2024 · 3 comments

Comments

@qqydss
Copy link

qqydss commented Oct 24, 2024

Introduction:
I have been following the work on Microsoft's weather model "auraro" and have carefully read through the paper and code. I am writing to seek clarification on some details regarding the experimental setup and model architecture. I would greatly appreciate your insights on the following questions:

  1. When using dataset configuration C4 for pretraining, if the inputs come from different data sources, is it required that their corresponding predicted future GroundTruth all come from the ERA5 dataset? In other words, could there be inputs with the same time label but slightly different, corresponding to the same GT? If so, could this be considered a form of data augmentation similar to distorting images in CV classification?

  2. In the "Comparison with AI models at 0.25° resolution" section, figure 4 shows the x-axis as token_num. Could you please explain how this number is calculated?

  3. For dataset labeled as C3, which has only 3 pressure levels in ensemble mode data, when a batch retrieves ensemble mode data, does the corresponding predicted future GD also only have 3 layers? If so, does it use the same weights for latent level query, atmospheric keys & values as shown in figure 6 of the article when input data has 13 pressure levels?

  4. In Figure 4b, is the input for auraro the "HRES Analysis" from HRES_T0 in 2022, and is the groundtruth ERA5?

  5. In the finetune settings of aurara-0.1°, is the GD ERA5?

  6. In figure 3b, is the input for auraro "HRES Analysis" from HRES-T0? As I understand, HRES starts every 12 hours, so there are only two zero lead time fields per day (00/12). Is the evaluation in figure 3b conducted every 12 hours?

  7. In supplement B.7, formula (9), is x the raw data or normalized data? Additionally, I plotted the curve of x_transformed and x and found they are not a monotonic bijective relationship, which might lead to multiple x corresponding to the same x_transformed, causing information loss. Has this factor been considered regarding its impact on model performance?

Image

8.Could you please elaborate on the process of "embedding dependent on the pressure level" in supplement B.7? For example, how does the tensor shape change? Is this operation only for pollution variables or also for U, V, T, Q, Z? Are the embeddings initialized using the weights from a 12-hour pretrained model for U, V, T, Q, Z, while initializing pollution variables from scratch?

9.In D.3-CAMS 0.4° Analysis, how are the learning rates for the backbone and perceiver-decoder set?

  1. In B.7, "Additional static variables" introduce two constant masks for timestamp. However, both the encoder and swin3d_backbone (AdaptiveLayerNorm) use Fourier encoding for timestamp in the code. Why reintroduce a timestamp mask in the input for pollution forecasting?

  2. In model/film.py, AdaptiveLayerNorm initializes self.ln_modulation’s weights and bias to 0, meaning shift and scale are 0 at the start of training, making the backbone almost equivalent to an identical mapping at the beginning. What is the rationale or empirical support behind this unique initialization method?

  3. In pollution forecasting experiments, concatenating static variables(z, slt, lsm) and atmospheric variables together instead of surface variables, what benefits does this bring? Is it performance improvement or computational efficiency?

13.In the fine tune of auraro-0.1°, when the patch size is increased from 4 to 10, is my understanding correct: are 10×10 patches interpolated into 4×4 patches before entering the embedding module, and then during the perceiver decoder stage, these 4×4 patches are interpolated back to 10×10 before unpatchifying to the forecast field pattern? If my understanding is incorrect, could you provide the correct procedure?

  1. In table 4, HRES-0.1 and HRES-0.25 datasets cover almost the same time span and contain exactly the same variables. Why does HRES-0.1 have far fewer Num frames than HRES-0.25?

Thank you very much for your time and consideration. I am eager to learn from your insights!

@wesselb
Copy link
Contributor

wesselb commented Oct 29, 2024

Hey @qqydss! Thank you for your very thorough questions. Just a quick message to let you know that we've seen this. :) We will back to you shortly!

@qqydss
Copy link
Author

qqydss commented Oct 30, 2024

Great to hear that you've received my questions and will get back to me soon. Looking forward to your response. Thanks!

@wesselb
Copy link
Contributor

wesselb commented Dec 3, 2024

Hey @qqydss! Apologies for the delay in getting back to you. We made a big push to get a new version of the paper out on arXiv, which we're pretty thrilled about. Let me answer your questions in order.

  1. In dataset configuration C4, for the same timestamp, the model indeed sees different inputs and targets from different sources. For example, for one batch, the model takes in ERA5 and predicts ERA5; and for another batch the model takes in HRES forecasts and predicts HRES forecasts.
  2. The encoder converts the batch to a token-based representation, in the following way: the model first performs a patch encoding of size 4x4 and then aggregates the real pressure levels into 3 "latent" pressure levels. For example, for a batch of resolution 0.25 degree, this results in in 720 / 4 * 1440 / 4 * (3 + 1) \approx 260k tokens per batch. The additional 1 comes from the surface-level variables.
  3. Yes, if the dataset has only three pressure levels, the model also predicts only three pressure levels. There should not be any pressure-level-specific weights. Instead, taking in and predicting pressure levels is done with a "positional encoding" which encodes the hPa of the pressure level, which allows the model to take in and predict any number of pressure levels.
  4. In all cases, the source for the input and target are the same. Hence, in the old Figure 4b, the input and output are both ERA5.
  5. When fine-tuning Aurora to 0.1 degrees resolution, the input and target are both IFS analysis 0.1 degree.
  6. In the old Figure 3b, the input and target are both IFS analysis 0.1 degrees. The high-quality IFS analysis is indeed available only every 12 hours, but an analysis product with a slightly smaller assimilation window should be available at hours 06 and 18.
  7. Good catch! The denominator should be -log(1e-4) instead of log(1e-4). That should make the function monotonic. We'll fix this in the writing. Thank you. :)
  8. The embedding uses a different set of parameters per pressure level. This makes the patch embedding a little more expressive. This is done for both the old meteorological variables and the new air pollution variables.
  9. In the description, "the rest of the network" includes both the backbone and the decoder, so the learning rate should be 1e-4.
  10. You're right that this is not strictly necessary. We added this to help the model a little, since the air pollution variables show very strong diurnal behaviour, much more strongly than the meteorological variables seen during pretraining.
  11. Initialising the model to the identity mapping means that the model predicts the inputs at initialisation, which is usually called the persistence prediction and which should be a pretty good initialisation. Generally, residual models like this tend to be more stable and a little easier to train.
  12. Concatenating the static variables also to the atmospheric variables means that the static variables can also influence the atmopsheric token embeddings. While this is not strictly necessary, we made this change to deal with the increased difficulty of predicting atmospheric chemistry.
  13. Both the patches in the encoder and decoder are changed to shape 10x10, and these new embeddings are initialised by interpolating the original 4x4 patches learned during pretraining to 10x10.
  14. Also a very good catch! There was an error in this table. Please see the revised version of the arXiv paper, where the error should be fixed.

I hope this answers all your questions. Please let me know if anything remains unclear. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants