Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokens used for regressino & queries #69

Open
Divadi opened this issue Aug 10, 2022 · 1 comment
Open

Tokens used for regressino & queries #69

Divadi opened this issue Aug 10, 2022 · 1 comment

Comments

@Divadi
Copy link

Divadi commented Aug 10, 2022

I noticed that after the transformer layer, AdaBins directly uses the first token for the bins predictions and the following 128 tokens for queries. If I am not mistaken, these 129 tokens correspond to actual locations in the image (since the tokens are flattened image features). I was wondering why you chose to have such an overlap, instead of perhaps using 129 specially-defined tokens.

@shariqfarooq123
Copy link
Owner

Hi Divadi,

Thanks for your interest in our work. This is an interesting question.

Since the self-attention layers combine information globally (from all tokens to all tokens), there's actually no natural correspondence between the input tokens and the output tokens of the transformer. Technically, you can treat all tokens after the first layer as however you interpret (based on the loss or later usage) (Note: First layer output has a weak 'query'-based correspondence as the "queries" are predicted from the corresponding tokens).

If you have a reason to distinguish between patch tokens and one token that represents the global information, you just add a dummy [CLS] token and use that as a "global" token instead. This is the case, for example, in language models where you really need to have a direct correspondence between input and output tokens. In our case, however, we don't need any such correspondence so the transformer is free to pool any kind of information in any of the tokens without discrimination or any bias towards corresponding patches.

With that said, adding extra 129 dummy input tokens might actually slightly increase the performance purely because of longer sequence length and more representation power but will also increase the memory footprint.

Hope that gives you some insights to the design choice in question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants