diff --git a/docs/source/faq.rst b/docs/source/faq.rst index b5157f5..943140d 100644 --- a/docs/source/faq.rst +++ b/docs/source/faq.rst @@ -132,7 +132,7 @@ Related work The main advantages of Modula over Tensor Programs are that: - 1. **Modula is grounded in elementary math.** We show that learning rate transfer is essentially just the question of how to build neural nets with tight and non-dimensional Lipschitz estimates. The main ingredient is just bounding derivatives and tracking how derivative bounds behave under composition and concatentation. We do not employ limiting or probabilistic analyses. + 1. **Modula is grounded in elementary math.** We show that learning rate transfer is essentially just the question of how to build neural nets with tight and non-dimensional Lipschitz estimates. The main ingredient is just bounding derivatives and tracking how derivative bounds behave under composition and concatenation. We do not employ limiting or probabilistic analyses. 2. **Modula theory is non-asymptotic.** The unifying thread through the Tensor Programs series of works is the study of neural network computation in limiting cases: infinite width, infinite depth, and so on. This means that the theory is encumbered by significant mathematical overhead, and one is often confronted with thorny technical questions---for example: `do width and depth limits commute? `_ In contrast, Modula is based on a completely non-asymptotic theory. It deals directly with the finite-sized neural networks that we actually use in practice, so you don't have to worry that certain technical details may be "lost in the limit". To show that this is not just talk, in our paper we `built a theory of an actual working transformer `_. 3. **Modula is more automatic.** In Modula, we automatically build a norm during construction of the computation graph that can be used to explicitly normalize weight updates taken from any base optimizer. The Tensor Programs approach essentially amounts to manually deriving a priori estimates on the size of this norm, and using these estimates to modify the SGD learning rate per layer. However, working out these prior estimates is quite a hairy procedure which seemingly does not always work, hence why later Tensor Programs papers `shift to modifying Adam updates `_. Adam updates are easier to deal with since they already impose a form of normalization on the gradients. Furthermore, the Tensor Programs calculations must be done by hand. The result is large tables of scaling rules, with tables of rules for different base optimizers (Adam versus SGD) and even tables for different matrix shapes (square versus wide rectangular versus skinny rectangular). @@ -182,4 +182,4 @@ Research philosophy .. dropdown:: Do I need to be a mathematical savant to contribute to research of this kind? :icon: question - I don't think so. There are a lot of very technical people working in this field bringing with them some quite advanced tools from math and theoretical physics, and this is great. But in my experience it's usually the simpler and more elementary ideas that actually work in practice. I strongly believe that deep learning theory is still at the stage of model building. And I resonate with both Rahimi and Recht's call for `"simple theorems" and "simple experiments" `_ and George Dahl's call for `a healthy dose of skepticism `_ when evaluating claims in the literature. \ No newline at end of file + I don't think so. There are a lot of very technical people working in this field bringing with them some quite advanced tools from math and theoretical physics, and this is great. But in my experience it's usually the simpler and more elementary ideas that actually work in practice. I strongly believe that deep learning theory is still at the stage of model building. And I resonate with both Rahimi and Recht's call for `simple theorems and simple experiments `_ and George Dahl's call for `a healthy dose of skepticism `_ when evaluating claims in the literature. \ No newline at end of file