Added an example for a Vision Transformer (ViT) #483

ahmed-alllam · 2023-09-08T09:31:53Z

This PR adds a practical example for a Vision Transformer (ViT) in Equinox, based on the paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

ASEM000 · 2023-09-08T09:44:54Z

Cool work @ahmed-alllam , AFAIK, you can use keras_core with jax backend to fetch mnist. The benefit is that you don't need torch - which is a bit heavy package - . for ref on this point, check here

patrick-kidger · 2023-09-08T09:48:32Z

Thanks for the contribution! Some quick comments:

I actually prefer the use of PyTorch to Keras here. Mainly because we're already using it in various spots, e.g. for DataLoaders, so better to be consistent.
I think MNIST is probably too simple a benchmark to really be worth applying a ViT to. Could you choose something more sophisticated?
Can you add a reference to eqxvision's implementation of a ViT? It'd be worth pointing people towards a featureful implementation.

ahmed-alllam · 2023-09-11T12:53:35Z

Hi @patrick-kidger! Any updates on this PR?

patrick-kidger · 2023-09-11T17:50:00Z

Looking over this now. GitHub doesn't yet allow us to leave comments on .ipynb files inline, so feedback as follows. All of these comments are pretty nitty, as overall this looks very clean!

capitalise "equinox" -> "Equinox"
use just list instead of typing.List. As of Python 3.9, the latter is now deprecated.
use from jaxtyping import PRNGKeyArray to annotate random keys, rather than that jr.PRNGKey. (The latter is a function, not a type.)
Can you add a comment to the GitHub repo for einops and optax, where they're imported? (C.f. the other examples.)
Can you add jaxtyping shape annotations to the various __call__ methods?
The annotations for AttentionBlock.linear{1,2} should be Linear, not Sequential.
prefer to do e.g. key1, key2, key3 = jr.split(key, 3), rather than indexing like keys[0] Preferring unpacking over indexing is general best practice in Python.
In VisionTransformer.__call__, can you do one big split, rather than multiple little splits? One big split is much more efficient at runtime.
You're using a mix of jax.Array and jnp.ndarray in annotations. Can you pick just one?

ahmed-alllam · 2023-09-17T13:49:56Z

@patrick-kidger Addressed your feedback and made the necessary changes. Please review and merge if everything looks good. Thanks!

patrick-kidger · 2023-09-22T20:28:58Z

I think the positional_embedding looks wrong -- you've defined a whole matrix and then only ever index into it at a single static value.

I think the x = x[0] line could do with a comment attached. The first time you see it, it's a bit surprising that only a single patch gets passed to the MLP.

Other than that, I think this looks very tidily done.

ahmed-alllam · 2023-09-22T21:15:17Z

Thank you for pointing that out!

You're right about the positional_embedding. It was a small oversight on my part. The intention was indeed to slice the positional_embedding array to fit the number of patches in the image. I've corrected it to:

x += self.positional_embedding[: x.shape[0]]    # Slice to the same length as x, as the positional embedding may be longer.

I've also added a comment to clarify the x = x[0] line:

x = x[0]    # Select the CLS token.

Also, do you have any insights into why the checks are failing? They passed successfully for all previous commits, and this was just a minor change. I've gone through the logs, and it appears there might be a dependency issue stemming from PyRight.

patrick-kidger · 2023-09-22T22:22:29Z

Try rebasing against dev. (And making that the target branch.) This was a change in JAX; now fixed in both dev and in the JAX release (jax-ml/jax#17684).

… and readability.

patrick-kidger · 2023-09-23T18:12:54Z

Alright, LGTM! Thank you for the example. This will appear in the docs for the next release of Equinox. (Once dev -> main.)

* Added an example for a vision transformer (vit) * Changed dataset to CIFAR10, added reference to eqxvision's ViT module * Refactored the Vision Transformer example for improved code structure and readability. * Fixed a small issue in positional embeddings

ahmed-alllam added 4 commits September 23, 2023 02:48

Added an example for a vision transformer (vit)

19b7e1c

Changed dataset to CIFAR10, added reference to eqxvision's ViT module

fdb643d

Refactored the Vision Transformer example for improved code structure…

8a7593e

… and readability.

Fixed a small issue in positional embeddings

427855d

ahmed-alllam force-pushed the main branch from 8c7d75f to 427855d Compare September 22, 2023 23:50

ahmed-alllam changed the base branch from main to dev September 22, 2023 23:51

patrick-kidger merged commit d9b018a into patrick-kidger:dev Sep 23, 2023
2 checks passed

patrick-kidger mentioned this pull request Sep 27, 2023

Tweaked ViT example #521

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added an example for a Vision Transformer (ViT) #483

Added an example for a Vision Transformer (ViT) #483

ahmed-alllam commented Sep 8, 2023

ASEM000 commented Sep 8, 2023

patrick-kidger commented Sep 8, 2023 •

edited

Loading

ahmed-alllam commented Sep 11, 2023

patrick-kidger commented Sep 11, 2023

ahmed-alllam commented Sep 17, 2023

patrick-kidger commented Sep 22, 2023 •

edited

Loading

ahmed-alllam commented Sep 22, 2023 •

edited

Loading

patrick-kidger commented Sep 22, 2023

patrick-kidger commented Sep 23, 2023

Added an example for a Vision Transformer (ViT) #483

Added an example for a Vision Transformer (ViT) #483

Conversation

ahmed-alllam commented Sep 8, 2023

ASEM000 commented Sep 8, 2023

patrick-kidger commented Sep 8, 2023 • edited Loading

ahmed-alllam commented Sep 11, 2023

patrick-kidger commented Sep 11, 2023

ahmed-alllam commented Sep 17, 2023

patrick-kidger commented Sep 22, 2023 • edited Loading

ahmed-alllam commented Sep 22, 2023 • edited Loading

patrick-kidger commented Sep 22, 2023

patrick-kidger commented Sep 23, 2023

patrick-kidger commented Sep 8, 2023 •

edited

Loading

patrick-kidger commented Sep 22, 2023 •

edited

Loading

ahmed-alllam commented Sep 22, 2023 •

edited

Loading