Skip to content

Commit

Permalink
add faq
Browse files Browse the repository at this point in the history
  • Loading branch information
jxbz committed Jun 8, 2024
1 parent 46ca48f commit 448e086
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 7 deletions.
13 changes: 8 additions & 5 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,17 @@
"sphinx_inline_tabs",
"sphinx.ext.autodoc",
"matplotlib.sphinxext.plot_directive",
"sphinxext.opengraph"
"sphinxext.opengraph",
"sphinx_design"
]
templates_path = ['_templates']
exclude_patterns = []
rst_prolog = """.. role:: python(code)
:language: python
:class: highlight
"""
rst_prolog = """
.. |nbsp| unicode:: U+00A0 .. NO-BREAK SPACE
.. role:: python(code)
:language: python
:class: highlight
"""

# -- Opengraph ---------------------------------------------------------------
ogp_site_url = "https://jeremybernste.in/modula/"
Expand Down
24 changes: 24 additions & 0 deletions docs/source/faq.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Frequently asked questions
===========================

.. |question| replace:: :octicon:`question;1em;sd-text-success;`

Feel free to reach out or start a `GitHub issue <https://github.com/jxbz/modula/issues>`_ if you have any questions about Modula. We'll post answers to any useful or common questions on this page.

|question| |nbsp| Why does modular normalization lead to learning rate transfer across scale?

In simple terms, when weight updates :math:`\Delta \mathbf{w}` are normalized in the modular norm :math:`\|\cdot\|_\mathsf{M}` of the module :math:`\mathsf{M}` then updates :math:`\Delta \mathbf{y}` to the module output are well-behaved in the output norm :math:`\|\cdot\|_\mathcal{Y}`, independent of the scale of the architecture. A little bit more formally:

1. modules are one-Lipschitz in the modular norm, meaning that :math:`\|\Delta \mathbf{y}\|_\mathcal{Y} \leq \|\Delta \mathbf{w}\|_\mathsf{M}`;
2. this inequality holds tightly when tensors in the network align during training, meaning that :math:`\|\Delta \mathbf{y}\|_\mathcal{y} \approx \|\Delta \mathbf{w}\|_\mathsf{M}` in a fully aligned network;
3. therefore normalizing updates in the modular norm provides control on the change in outputs, independent of the size of the architecture.

Since modular normalization works by recursively normalizing the weight updates to each submodule, these desirable properties in fact extend to all submodules as well as the overall compound.

|question| |nbsp| Is it necessary to use orthogonal intialization in Modula?

No. You could re-write the atomic modules to use Gaussian initialization if you wanted. The reason we choose to use orthogonal initialization is that it makes it much easier to get scaling right. This is because the spectral norm of any :math:`m \times n` random orthogonal matrix is always one. In contrast, the spectral norm of an :math:`m \times n` random Gaussian matrix depends on the dimensions :math:`m` and :math:`n` and also the entry-wise variance :math:`\sigma^2`, making it more difficult to properly set the initialization scale. In addition, orthogonal matrices have the benign property that all singular values are one. In Gaussian matrices, on the other hand, the average singular value and the max singular value are different, meaning that Gaussian matrices have more subtle numerical properties.

|question| |nbsp| Does Modula support weight sharing?

Not yet, although we plan to implement this and provide some examples.
10 changes: 8 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,11 @@ If you like math better than code, then you might prefer to read `our paper <htt
year = 2024
}
Acknowledgements
^^^^^^^^^^^^^^^^^

Thanks to Gavia Gray, Uzay Girit and Jyo Pari for helpful feedback.

.. toctree::
:hidden:
:maxdepth: 2
Expand All @@ -49,7 +54,7 @@ If you like math better than code, then you might prefer to read `our paper <htt
.. toctree::
:hidden:
:maxdepth: 2
:caption: Theory of modules:
:caption: Theory of Modules:

theory/vector
theory/module
Expand All @@ -60,7 +65,8 @@ If you like math better than code, then you might prefer to read `our paper <htt
.. toctree::
:hidden:
:maxdepth: 2
:caption: Useful links:
:caption: More on Modula:

Modula FAQ <faq>
Modula codebase <https://github.com/jxbz/modula>
Modula paper <https://arxiv.org/abs/2405.14813>

0 comments on commit 448e086

Please sign in to comment.