Update docs

facebookresearch · Dec 26, 2024 · be78960 · be78960
commit be78960
Show file tree

Hide file tree

Showing 502 changed files with 136,498 additions and 0 deletions.
diff --git a/.nojekyll b/.nojekyll
diff --git a/index.html b/index.html
@@ -0,0 +1,11 @@
+<!DOCTYPE html>
+<html>
+  <head>
+    <title>SPDL</title>
+  </head>
+  <body>
+    <p>Redirecting to the main document.</p>
+    <script>
+      window.location.href= "./main/index.html";
+    </script>
+</html>
diff --git a/main/_sources/api.rst.txt b/main/_sources/api.rst.txt
@@ -0,0 +1,13 @@
+API Reference
+=============
+
+.. autosummary::
+   :toctree: generated
+   :template: _custom_autosummary_module.rst
+   :caption: API Reference
+   :recursive:
+
+   spdl.io
+   spdl.pipeline
+   spdl.dataloader
+   spdl.utils
diff --git a/main/_sources/best_practice.rst.txt b/main/_sources/best_practice.rst.txt
@@ -0,0 +1,133 @@
+Best Practices
+==============
+
+Avoid creating intermediate tensors
+-----------------------------------
+
+For efficient and performant data processing, it is advised to not create
+an intermediate Tensor for each individual media object (such as single image),
+instead create a batch Tensor directly.
+
+We recommend decoding individual frames, then using :py:func:`spdl.io.convert_frames`
+to create a batch Tensor directly without creating an intermediate Tensors.
+
+If you are decoding a batch of images, and you have pre-determined set of images
+that should go together into a same batch, use
+:py:func:`spdl.io.load_image_batch` (or its async variant
+:py:func:`spdl.io.async_load_image_batch`).
+
+Otherwise, demux, decode and pre-process multiple media, then combine them with
+:py:func:`spdl.io.convert_frames` (or :py:func:`spdl.io.async_convert_frames`).
+For example, the following functions implement decoding and tensor creation
+separately.
+
+.. code-block::
+
+   import spdl.io
+   from spdl.io import ImageFrames
+
+   def decode_image(src: str) -> ImageFrames:
+       packets = spdl.io.async_demux_image(src)
+       return spdl.io.async_decode_packets(packets)
+
+   def batchify(frames: list[ImageFrames]) -> ImageFrames:
+       buffer = spdl.io.convert_frames(frames)
+       return spdl.io.to_torch(buffer)
+
+They can be combined in :py:class:`~spdl.pipeline.Pipeline`, which automatically
+discards the items failed to process (for example due to invalid data), and
+keep the batch size consistent by using other items successfully processed.
+
+.. code-block::
+
+   from spdl.pipeline import PipelineBuilder
+
+   pipeline = (
+       PipelineBuilder()
+       .add_source(...)
+       .pipe(decode_image, concurrency=...)
+       .aggregate(32)
+       .pipe(batchify)
+       .add_sink(3)
+       .build(num_threads=...)
+   )
+
+.. seealso::
+
+   :py:mod:`multi_thread_preprocessing`
+
+Make Dataset class composable
+-----------------------------
+
+If you are publishing a dataset and providing an implementation of
+`Dataset` class, we recommend to make it composable.
+
+That is, in addition to the conventional ``Dataset`` class that
+returns Tensors, make the components of the ``Dataset``
+implementation available by breaking down the implementation into
+
+* Iterator (or map) interface that returns paths instead of Tensors.
+* A helper function that loads the source path into Tensor.
+
+For example, the interface of a ``Dataset`` for image classification
+might look like the following.
+
+.. code-block::
+
+   class Dataset:
+       def __getitem__(self, key: int) -> tuple[Tensor, int]:
+           ...
+
+We recommend to separate the source and process and make them additional
+public interface.
+(Also, as described above, we recommend to not convert each item into
+``Tensor`` for the performance reasons.)
+
+.. code-block::
+
+   class Source:
+       def __getitem__(self, key: int) -> tuple[str, int]:
+           ...
+
+   def load(data: tuple[str, int]) -> tuple[ImageFrames, int]:
+       ...
+
+and if the processing is composed of stages with different bounding
+factor, then split them further into primitive functions.
+
+.. code-block::
+
+   def download(src: tuple[str, int]) -> tuple[bytes, int]:
+       ...
+
+   def decode_and_preprocess(data: tuple[bytes, int]) -> tuple[ImageFrames, int]:
+       ...
+
+then the original ``Dataset`` can be implemented as a composition
+
+.. code-block::
+
+   class Dataset:
+       def __init__(self, ...):
+           self._src = Source(...)
+
+       def __getitem__(self, key:int) -> tuple[str, int]:
+           metadata = self._src[key]
+           item = download(metadata)
+           frames, cls = decode_and_preprocess(item)
+           tensor = spdl.io.to_torch(frames)
+           return tensor, cls
+
+Such decomposition makes the dataset compatible with SPDL's Pipeline,
+and allows users to run them more efficiently
+
+.. code-block::
+
+   pipeline = (
+       PipelineBuilder()
+       .add_source(Source(...))
+       .pipe(download, concurrency=8)
+       .pipe(decode_and_preprocess, concurrency=4)
+       ...
+       .build(...)
+   )
diff --git a/main/_sources/cpp.rst.txt b/main/_sources/cpp.rst.txt
@@ -0,0 +1,9 @@
+API References (C++)
+====================
+
+.. toctree::
+   :caption: API References (C++)
+
+   Class List <generated/libspdl/libspdl_class>
+   File List <generated/libspdl/libspdl_file>
+   API <generated/libspdl/libspdl_api>
diff --git a/main/_sources/examples.rst.txt b/main/_sources/examples.rst.txt
@@ -0,0 +1,13 @@
+Examples
+========
+
+.. autosummary::
+   :toctree: generated
+   :template: _custom_autosummary_example.rst
+   :caption: Examples
+   :recursive:
+
+   image_dataloading
+   video_dataloading
+   imagenet_classification
+   multi_thread_preprocessing
diff --git a/main/_sources/faq.rst.txt b/main/_sources/faq.rst.txt
@@ -0,0 +1,112 @@
+Frequently Asked Questions
+==========================
+
+How to work around GIL?
+-----------------------
+
+In Python, GIL (Global Interpreter Lock) practically prevents the use of multi-threading, however extension modules that are written in low-level languages, such as C, C++ and Rust, can release GIL when executing operations that do not interact with Python interpreter.
+
+Many libraries used for data loading release the GIL. To name a few;
+
+- Pillow
+- OpenCV
+- Decord
+- tiktoken
+
+Typically, the bottleneck of model training in loading and pre-processing the media data.
+So even though there are still parts of pipelines that are constrained by GIL,
+by taking advantage of pre-processing functions that release GIL,
+we can achieve high throughput.
+
+What if a function does not release GIL?
+----------------------------------------
+
+In case you need to use a function that takes long time to execute (e.g. network utilities)
+but it does not release GIL, you can delegate the stage to sub-process.
+
+:py:meth:`spdl.pipeline.PipelineBuilder.pipe` method takes an optional ``executor`` argument.
+The default behavior of the ``Pipeline`` is to use the thread pool shared among all stages.
+You can pass an instance of :py:class:`concurrent.futures.ProcessPoolExecutor`,
+and that stage will execute the function in the sub-process.
+
+.. code-block::
+
+   executor = ProcessPoolExecutor(max_workers=num_processes)
+
+   pipeline = (
+       PipelineBuilder()
+       .add_source(src)
+       .pipe(stage1, executor=executor, concurrency=num_processes)
+       .pipe(stage2, ...)
+       .pipe(stage3, ...)
+       .add_sink(1)
+       .build()
+   )
+
+This will build pipeline like the following.
+
+.. include:: ./plots/faq_subprocess_chart.txt
+
+.. note::
+
+   Along with the function arguments and return values, the function itself is also
+   serialized and passed to the sub-process. Therefore, the function to be executed
+   must be a plain function. Closures and class methods cannot be passed.
+
+.. tip::
+
+   If you need to perform one-time initialization in sub-process, you can use
+   ``initializer`` and ``initargs`` arguments.
+
+   The values passed as ``initializer`` and ``initargs`` must be picklable.
+   If constructing an object in a process that does not support picke, then
+   you can pass constructor arguments instead and store the resulting object
+   in global scope. See also https://stackoverflow.com/a/68783184/3670924.
+
+   Example
+
+   .. code-block::
+
+      def init_resource(*args):
+          global rsc
+          rsc = ResourceClass(*args)
+
+      def process_with_resource(item):
+          global rsc
+
+          return rsc.process(item)
+
+      executor = ProcessPoolExecutor(
+          max_workers=4,
+          mp_context=None,
+          initializer=init_resource,
+          initargs=(...),
+      )
+
+      pipeline = (
+          PipelineBuilder()
+          .add_source()
+          .pipe(
+              process_with_resource,
+              executor=executor,
+              concurrency=4,
+          )
+          .add_sink(3)
+          .build()
+      )
+
+Which functions hold the GIL?
+-----------------------------
+
+The following is the list of functions that we are aware that they hold the GIL. So it is advised to use them with ``ProcessPoolExecutor`` or avoid using them in SPDL.
+
+* `np.load <https://github.com/numpy/numpy/blob/maintenance/2.1.x/numpy/lib/_npyio_impl.py#L312-L500>`_
+
+Why Async IO?
+-------------
+
+When training a model with large amount of data, the data are retrieved from remote locations. Network utilities often provide APIs based on Async I/O.
+
+The Async I/O allows to easily build complex data pre-processing pipeline and execute them while automatically parallelizing parts of the pipeline, achieving high throughput.
+
+Synchronous operations that release GIL can be converted to async operations easily by running them in a thread pool. So by converting the synchronous pre-processing functions that release GIL into asynchronous operations, the entire data pre-processing pipeline can be executed in async event loop. The event loop handles the scheduling of data processing functions, and execute them concurrently.