-
-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added an example for a Vision Transformer (ViT) #483
Conversation
Cool work @ahmed-alllam , AFAIK, you can use |
Thanks for the contribution! Some quick comments:
|
Hi @patrick-kidger! Any updates on this PR? |
Looking over this now. GitHub doesn't yet allow us to leave comments on
|
@patrick-kidger Addressed your feedback and made the necessary changes. Please review and merge if everything looks good. Thanks! |
I think the I think the Other than that, I think this looks very tidily done. |
Thank you for pointing that out! You're right about the x += self.positional_embedding[: x.shape[0]] # Slice to the same length as x, as the positional embedding may be longer. I've also added a comment to clarify the x = x[0] # Select the CLS token. Also, do you have any insights into why the checks are failing? They passed successfully for all previous commits, and this was just a minor change. I've gone through the logs, and it appears there might be a dependency issue stemming from PyRight. |
Try rebasing against |
Alright, LGTM! Thank you for the example. This will appear in the docs for the next release of Equinox. (Once |
* Added an example for a vision transformer (vit) * Changed dataset to CIFAR10, added reference to eqxvision's ViT module * Refactored the Vision Transformer example for improved code structure and readability. * Fixed a small issue in positional embeddings
This PR adds a practical example for a Vision Transformer (ViT) in Equinox, based on the paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.