Skip to content

Commit

Permalink
some tweaks
Browse files Browse the repository at this point in the history
  • Loading branch information
jxbz committed Jun 5, 2024
1 parent 932b886 commit 46ca48f
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 5 deletions.
2 changes: 1 addition & 1 deletion docs/source/bad-scaling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,6 @@ The good news is that we have developed machinery that largely solves these scal
weights -= learning_rate * normalize(gradient)
This initialization and gradient normalization *removes drift* in the optimal learning rate, and causes performance to *improve* with increasing scale. Modula automatically infers the necessary initialize and normalize functions from the architecture of the network. So the user can focus on writing their neural network architecture while Modula will handle properly normalizing the training.
This initialization and gradient normalization removes drift in the optimal learning rate, and causes performance to improve with increasing scale. Modula automatically infers the necessary initialize and normalize functions from the architecture of the network. So the user can focus on writing their neural network architecture while Modula will handle properly normalizing the training.

These docs are intended to explain how Modula works and also introduce the Modula API. In case you don't care about Modula or automatic gradient normalization, the next section will explain how you can normalize training manually in a different framework like `PyTorch <https://pytorch.org>`_ or `JAX <https://github.com/google/jax>`_.
4 changes: 2 additions & 2 deletions docs/source/golden-rules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,9 @@ To spell this out more clearly, the dot product is the sum of a number :python:`
Wrapping up
^^^^^^^^^^^^

On this page, we introduced three "golden rules" for scaling and pointed out how they differ to some conventional wisdom about controlling activation variance at initialization. One of the points we hope to get across is that the logical reasoning associated with the golden rules is not only *more scalable* but also *simpler* than standard approaches based on controlling variance. You don't need to know anything about how random variables behave in order to get scaling right---you just need to know how objects add when they point in the same direction. Furthermore, the use of orthogonal initialization obviates the need to know anything about the spectral properties of Gaussian random matrices.
On this page, we introduced three "golden rules" for scaling and pointed out how they differ to some conventional wisdom about controlling activation variance at initialization. Something we hope to get across is that the logic associated with the golden rules is not only *more scalable* than standard approaches based on controlling variance, but also *simpler*. You don't need to know anything about how random variables behave in order to get the scaling right---you just need to know how objects add when they point in the same direction. And in the same vein, the use of orthogonal initialization obviates the need to know anything about the spectral properties of Gaussian random matrices.

In the next section we will look at the history behind these ideas, and after that we will explain how Modula automates the application of the golden rules.
In the next section we will look at the history behind these ideas. After that we will move on to explaining how Modula automates the application of the golden rules.


.. [#outerproduct] The mathematical analogue of this intuitive statement is to say that the gradient of a linear layer is an outer product of the layer input with the gradient of the loss with respect to the layer output.
Expand Down
9 changes: 7 additions & 2 deletions docs/source/history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ The science of scale

some twists and turns

Pre-history
^^^^^^^^^^^^

| 📘 `On the distance between two neural networks and the stability of learning <https://arxiv.org/abs/2002.03432>`_
| Jeremy Bernstein, Arash Vahdat, Yisong Yue, Ming-Yu Liu
| NeurIPS 2020
Expand All @@ -23,13 +26,15 @@ some more text
| Greg Yang, Edward J. Hu
| ICML 2021
and more text
Truth and reconciliation
^^^^^^^^^^^^^^^^^^^^^^^^^

| 📗 `A spectral condition for feature learning <https://arxiv.org/abs/2310.17813>`_
| Greg Yang, James B. Simon, Jeremy Bernstein
| arXiv 2023
and more
Automation of training
^^^^^^^^^^^^^^^^^^^^^^^

| 📒 `Automatic gradient descent: Deep learning without hyperparameters <https://arxiv.org/abs/2304.05187>`_
| Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan, Yisong Yue
Expand Down

0 comments on commit 46ca48f

Please sign in to comment.