some tweaks

modula-systems · Jun 5, 2024 · 46ca48f · 46ca48f
1 parent 932b886
commit 46ca48f
Show file tree

Hide file tree

Showing 3 changed files with 10 additions and 5 deletions.
diff --git a/docs/source/bad-scaling.rst b/docs/source/bad-scaling.rst
@@ -27,6 +27,6 @@ The good news is that we have developed machinery that largely solves these scal
 
    weights -= learning_rate * normalize(gradient)
 
-This initialization and gradient normalization *removes drift* in the optimal learning rate, and causes performance to *improve* with increasing scale. Modula automatically infers the necessary initialize and normalize functions from the architecture of the network. So the user can focus on writing their neural network architecture while Modula will handle properly normalizing the training. 
+This initialization and gradient normalization removes drift in the optimal learning rate, and causes performance to improve with increasing scale. Modula automatically infers the necessary initialize and normalize functions from the architecture of the network. So the user can focus on writing their neural network architecture while Modula will handle properly normalizing the training. 
 
 These docs are intended to explain how Modula works and also introduce the Modula API. In case you don't care about Modula or automatic gradient normalization, the next section will explain how you can normalize training manually in a different framework like `PyTorch <https://pytorch.org>`_ or `JAX <https://github.com/google/jax>`_.
diff --git a/docs/source/golden-rules.rst b/docs/source/golden-rules.rst
@@ -124,9 +124,9 @@ To spell this out more clearly, the dot product is the sum of a number :python:`
 Wrapping up
 ^^^^^^^^^^^^
 
-On this page, we introduced three "golden rules" for scaling and pointed out how they differ to some conventional wisdom about controlling activation variance at initialization. One of the points we hope to get across is that the logical reasoning associated with the golden rules is not only *more scalable* but also *simpler* than standard approaches based on controlling variance. You don't need to know anything about how random variables behave in order to get scaling right---you just need to know how objects add when they point in the same direction. Furthermore, the use of orthogonal initialization obviates the need to know anything about the spectral properties of Gaussian random matrices.
+On this page, we introduced three "golden rules" for scaling and pointed out how they differ to some conventional wisdom about controlling activation variance at initialization. Something we hope to get across is that the logic associated with the golden rules is not only *more scalable* than standard approaches based on controlling variance, but also *simpler*. You don't need to know anything about how random variables behave in order to get the scaling right---you just need to know how objects add when they point in the same direction. And in the same vein, the use of orthogonal initialization obviates the need to know anything about the spectral properties of Gaussian random matrices.
 
-In the next section we will look at the history behind these ideas, and after that we will explain how Modula automates the application of the golden rules.
+In the next section we will look at the history behind these ideas. After that we will move on to explaining how Modula automates the application of the golden rules.
 
 
 .. [#outerproduct] The mathematical analogue of this intuitive statement is to say that the gradient of a linear layer is an outer product of the layer input with the gradient of the loss with respect to the layer output.

diff --git a/docs/source/history.rst b/docs/source/history.rst
@@ -13,6 +13,9 @@ The science of scale
 
 some twists and turns
 
+Pre-history
+^^^^^^^^^^^^
+
    | 📘 `On the distance between two neural networks and the stability of learning <https://arxiv.org/abs/2002.03432>`_
    |     Jeremy Bernstein, Arash Vahdat, Yisong Yue, Ming-Yu Liu
    |     NeurIPS 2020
@@ -23,13 +26,15 @@ some more text
    |     Greg Yang, Edward J. Hu
    |     ICML 2021
 
-and more text
+Truth and reconciliation
+^^^^^^^^^^^^^^^^^^^^^^^^^
 
    | 📗 `A spectral condition for feature learning <https://arxiv.org/abs/2310.17813>`_
    |     Greg Yang, James B. Simon, Jeremy Bernstein
    |     arXiv 2023
 
-and more
+Automation of training
+^^^^^^^^^^^^^^^^^^^^^^^
 
    | 📒 `Automatic gradient descent: Deep learning without hyperparameters <https://arxiv.org/abs/2304.05187>`_
    |     Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan, Yisong Yue