Skip to content

Commit

Permalink
Merge pull request #25 from jasondavies/typos
Browse files Browse the repository at this point in the history
Fix typos.
  • Loading branch information
vmilosevic authored Jun 20, 2024
2 parents d2a9882 + 603f581 commit 6813def
Show file tree
Hide file tree
Showing 4 changed files with 20 additions and 20 deletions.
20 changes: 10 additions & 10 deletions docs/public/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ User Visible Constants
++++++++++++++++++++++

Constant registers are implemented as objects which can be referenced
whereever a vector can be used.
wherever a vector can be used.

* Grayskull:

Expand Down Expand Up @@ -230,8 +230,8 @@ Library

Below ``Vec`` means any vector type.

Grayskulll and Wormhole
^^^^^^^^^^^^^^^^^^^^^^^
Grayskull and Wormhole
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: c++

Expand Down Expand Up @@ -396,8 +396,8 @@ For example:
l_reg[LRegs::LReg1] = x; // this is necessary at the end of the function
// to preserve the value in LReg1 (if desired)

Miscelaneous
************
Miscellaneous
*************

Register Pressure Management
++++++++++++++++++++++++++++
Expand All @@ -413,7 +413,7 @@ loads dst_reg[0] and dst_reg[1] into temporary LREGs (as expected).

The compiler will not spill registers. Exceeding the number of registers
available will result in the cryptic: ``error: cannot store SFPU register
(reigster spill?) - exiting!`` without a line number.
(register spill?) - exiting!`` without a line number.

The compiler does a reasonable job with lifetime analysis when assigning
variables to registers. Reloading or recalculating results helps the compiler
Expand Down Expand Up @@ -448,7 +448,7 @@ The ``SFPREPLAY`` instruction available on Wormhole allows the RISCV processor
to submit up to 32 SFP instructions at once. The compiler looks for sequences
of instructions that repeat, stores these and then "replays" them later.

The current implemention of this is very much first cut: it does not handle
The current implementation of this is very much first cut: it does not handle
kernels with rolled up loops very well. Best performance is typically attained by
unrolling the top level loop and then letting the compiler find the repetitions
and replace them with ``SFPREPLAY``. This works well when the main loop
Expand Down Expand Up @@ -494,15 +494,15 @@ Register Spilling
+++++++++++++++++

The compiler does not implement register spilling. Since Grayskull only has 4
LRegs, running out of registers is a common occurence. If you see the
following: ``error: cannot store SFPU register (reigster spill?) - exiting!``
LRegs, running out of registers is a common occurrence. If you see the
following: ``error: cannot store SFPU register (register spill?) - exiting!``
you have most likely run out of registers.

Error Messages
++++++++++++++

Unfortunately, many errors are attributed to the code in the wrapper rather than in the code
being written. For example, using an unitialized variable would show an error at a macro
being written. For example, using an uninitialized variable would show an error at a macro
called by a wrapper function before showing the line number in the user's code.

Function Calls
Expand Down
2 changes: 1 addition & 1 deletion docs/public/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Python Environment Installation

It is strongly recommended to use virtual environments for each project utilizing PyBUDA and Python dependencies. Creating a new virtual environment with PyBUDA and libraries is very easy.

Prerequisites (detailed sections below) for python envirnment installation are listed here:
Prerequisites (detailed sections below) for python environment installation are listed here:

* `Setup HugePages (below) <#setup-hugepages>`_
* `PCI Driver Installation (below) <#pci-driver-installation>`_
Expand Down
4 changes: 2 additions & 2 deletions docs/public/terminology.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The dense tensor math unit in Tensix. It performs bulk tensor math operations, s

SFPU
----
Tensix SIMD engine, used for various miscellaneous activations operations, such as exponents, square roots, softmax, topK, and others.
Tensix SIMD engine, used for various miscellaneous activation operations, such as exponents, square roots, softmax, topK, and others.

Unpacker
--------
Expand All @@ -49,7 +49,7 @@ A collection of ops that fits onto one chip. In a typical workflow, epoch code w

Buffer
------
A reserved location in local memory, DRAM, or host memory. Buffers are used either as desinations for operation outputs, sources for operation inputs, or temporary locations for intermediate data.
A reserved location in local memory, DRAM, or host memory. Buffers are used either as destinations for operation outputs, sources for operation inputs, or temporary locations for intermediate data.

Pipe
----
Expand Down
14 changes: 7 additions & 7 deletions docs/public/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ PyBuda API and workflow is flexible enough that some of these steps can be merge
Devices
*******

PyBuda makes it easy to distribute a workload onto a heterogenous set of devices available to you. This can be one or more
PyBuda makes it easy to distribute a workload onto a heterogeneous set of devices available to you. This can be one or more
Tenstorrent devices, CPUs, or GPUs. Each device that will be used to run your workflow needs to be declared by creating the appropriate
device type and giving it a unique name:

Expand Down Expand Up @@ -121,7 +121,7 @@ To run a module on a device, it needs to be "placed" on it
tt0.place_module(mod)
This tells PyBuda that module ``mod`` needs to be compiled and executed on device ``tt0``. In this case, ``mod`` is a native PyBuda module. To
simiarly place a PyTorch module onto a Tenstorrent device, the module must be wrapped in a :py:class:`PyTorchModule<pybuda.PyTorchModule>` wrapper:
similarly place a PyTorch module onto a Tenstorrent device, the module must be wrapped in a :py:class:`PyTorchModule<pybuda.PyTorchModule>` wrapper:

.. code-block:: python
Expand All @@ -147,7 +147,7 @@ PyBuda provides all-in-one APIs for compiling and running workloads, :py:func:`r
For inference, and simple training setups, this is the simplest way to get up and running.

Alternatively, the models can be compiled in a separate step, using the :py:func:`initialize_pipeline<pybuda.initialize_pipeline>` call,
which optioanlly takes sample inputs, if none have been pushed into the first device. Once the compilation has completed, the user
which optionally takes sample inputs, if none have been pushed into the first device. Once the compilation has completed, the user
can run :py:func:`run_forward<pybuda.run_forward>` pass through the pipeline for inference, or a loop of
:py:func:`run_forward<pybuda.run_forward>`, :py:func:`run_backward<pybuda.run_backward>`, and :py:func:`run_optimizer<pybuda.run_optimizer>`
calls to manually implement a training loop:
Expand All @@ -165,10 +165,10 @@ calls to manually implement a training loop:
CPU Fallback
************

If there are operators in the workload that are unsuppored by PyBuda, the user can create a CPUDevice and place module containing those
If there are operators in the workload that are unsupported by PyBuda, the user can create a CPUDevice and place module containing those
operators onto that CPUDevice. If enabled, PyBuda is capable of doing this automatically.

If a TTDevice contains unsuppored operators, during compilation, the device will be split into mupltiple devices (TTDevice and CPUDevice). If
If a TTDevice contains unsupported operators, during compilation, the device will be split into multiple devices (TTDevice and CPUDevice). If
the CPUDevice is at the front of the pipeline (i.e. the unsupported ops are in the first half of the graph), any inputs pushed to the TTDevice
will be redirected to the correct CPUDevice.

Expand Down Expand Up @@ -647,7 +647,7 @@ Using Multiple Tenstorrent Devices

PyBuda makes it easy to parallelize workloads onto multiple devices. A single :py:class:`TTDevice<pybuda.TTDevice>` can be used as a wrapper to any number of available
Tenstorrent devices accessible to the host - either locally or through ethernet. The PyBuda compiler will then break up the workload over
assigned devices using either pipeline or model parllelism strategies, or a combination of both.
assigned devices using either pipeline or model parallelism strategies, or a combination of both.

The easiest way to use all available hardware is to set ``num_chips`` parameter in :py:class:`TTDevice<pybuda.TTDevice>` to 0, which instructs it to auto-detect and use everything it can find.
However, ``num_chips`` and ``chip_ids`` parameters can be used to select a subset of available hardware:
Expand Down Expand Up @@ -776,7 +776,7 @@ The following Python code generates a Multi-Model TTI in a manner identical to t

During the model fusion process, the API presented above is responsible for performing memory reallocation. Users may be interested in the memory footprint of the fused model (both Device and Host DRAM).

To fullfil this requirement, the tool reports memory utilization post reallocation. An example using a model compiled for Wormhole (with 6 Device and upto 4 Host DRAM channels) is provided below.
To fulfill this requirement, the tool reports memory utilization post reallocation. An example using a model compiled for Wormhole (with 6 Device and up to 4 Host DRAM channels) is provided below.

.. code-block:: bash
Expand Down

0 comments on commit 6813def

Please sign in to comment.