You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that after the transformer layer, AdaBins directly uses the first token for the bins predictions and the following 128 tokens for queries. If I am not mistaken, these 129 tokens correspond to actual locations in the image (since the tokens are flattened image features). I was wondering why you chose to have such an overlap, instead of perhaps using 129 specially-defined tokens.
The text was updated successfully, but these errors were encountered:
Thanks for your interest in our work. This is an interesting question.
Since the self-attention layers combine information globally (from all tokens to all tokens), there's actually no natural correspondence between the input tokens and the output tokens of the transformer. Technically, you can treat all tokens after the first layer as however you interpret (based on the loss or later usage) (Note: First layer output has a weak 'query'-based correspondence as the "queries" are predicted from the corresponding tokens).
If you have a reason to distinguish between patch tokens and one token that represents the global information, you just add a dummy [CLS] token and use that as a "global" token instead. This is the case, for example, in language models where you really need to have a direct correspondence between input and output tokens. In our case, however, we don't need any such correspondence so the transformer is free to pool any kind of information in any of the tokens without discrimination or any bias towards corresponding patches.
With that said, adding extra 129 dummy input tokens might actually slightly increase the performance purely because of longer sequence length and more representation power but will also increase the memory footprint.
Hope that gives you some insights to the design choice in question.
I noticed that after the transformer layer, AdaBins directly uses the first token for the bins predictions and the following 128 tokens for queries. If I am not mistaken, these 129 tokens correspond to actual locations in the image (since the tokens are flattened image features). I was wondering why you chose to have such an overlap, instead of perhaps using 129 specially-defined tokens.
The text was updated successfully, but these errors were encountered: