You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your excellent work. We have a question regarding the attention visualization in Figure 2 of your paper.
We attempted to reproduce the visualization using the following approach:
Taking the last transformer layer
Summing across all attention heads
And an example of our method is below. However, our results differ significantly from yours. Even when I set my code to be the first head of the first transformer layer, I can't get such a high score and such a significant distribution pattern like yours.
Could you help us understand, if there might be any specific preprocess or normalization steps we're missing?
To help us better understand and reproduce your results, would it be possible to share the visualization code you used? This would be incredibly helpful for our research.
Thank you for your time and assistance.
The text was updated successfully, but these errors were encountered:
Thank you for your excellent work. We have a question regarding the attention visualization in Figure 2 of your paper.
We attempted to reproduce the visualization using the following approach:
And an example of our method is below. However, our results differ significantly from yours. Even when I set my code to be the first head of the first transformer layer, I can't get such a high score and such a significant distribution pattern like yours.
Could you help us understand, if there might be any specific preprocess or normalization steps we're missing?
To help us better understand and reproduce your results, would it be possible to share the visualization code you used? This would be incredibly helpful for our research.
Thank you for your time and assistance.
The text was updated successfully, but these errors were encountered: