Skip to content

v0.4.0

Latest
Compare
Choose a tag to compare
@dvorjackz dvorjackz released this 18 Oct 19:00
· 1159 commits to main since this release
6a085ff

We're excited to announce the Beta release of ExecuTorch! This release includes many new features, improvements, and bug fixes.

API Stability and Runtime Compatibility Guarantees

Starting with this release, ExecuTorch's Python and C++ APIs will follow the API Lifecycle and Deprecation Policy, and the .pte file format will comply with the Runtime Compatibility Policy.

New Features

  • Introduced exir.to_edge_transform_and_lower API for combining the functionality of to_edge, transform, and to_backend
    • Allows users to prevent specific op decompositions while lowering to backends that implement those ops
  • Increased operator coverage for ExecuTorch’s portable library
  • Added new experimental APIs:
    • LLM runner C++ APIs such as prefill_image(), prefill_prompt(), and generate_from_pos() with multimodal support
    • executorch.runtime python module for loading .pte files and running them with the underlying C++ runtime
  • Added a new Tensor API to bundle the dynamic data and metadata within a Tensor object.
  • Improved the Module API to share an ExecuTorch Program between several Modules and provide APIs to set inputs/outputs before execution
  • Added find_package(executorch) for projects to easily link to ExecuTorch’s prebuilt library in CMake
  • Introduced reproducible benchmarking infrastructure to measure, debug, and track performance, enabling on-demand and automated nightly benchmarking of models and backend delegates on modern smartphones
    • New benchmarking apps for Apple platforms to measure model performance on iOS/macOS and Android
  • Added support for TikToken v5 vision tokenizer
  • Improved parallelization for LLM prefill
  • Added experimental capabilities for on-device training, along with an example prototype for LLM finetuning

Supported Models

  • Added support for the following models:
    • LLaMA 3 models, including LLaMA 3 8B, 3.1 8B, and 3.2 1B/3B
    • [MultiModal] LLaVA (Large Language and Vision Assistant)
    • Phi-3-mini
    • Gemma 2B
  • Added LLaMA 3, 3.1, and 3.2 to the Android Llama Demo app
  • Added LLaVa multimodal support to the iOS iLLaMA and Android LLaMa Demo apps

Hardware Acceleration

  • Delegate framework
  • [New] MediaTek
    • Added support for a new MediaTek backend
    • Enabled LLaMa 3 acceleration on MediaTek’s NPU
    • Added export scripts and runners for 8 different OSS models
    • Implemented intermediate tensor logging
  • CoreML
    • Added LLaMA support for in-place KV cache, fused SDPA kernel, and 4-bit per-block quantization
    • Added primitive support for dynamic shapes to work without torch._check
    • Expanded operator coverage to over 100 ops
    • Enabled stateful runtime execution
    • Implemented Intermediate tensor logging
  • MPS
    • Added support for 4-bit linear kernels (iOS 18 only)
    • Enabled LLaMa 2 7B and LLaMa 3 8B
  • Qualcomm (Qualcomm Neural Network)
    • Enabled LLaMa 3 8B with 4-bit linear kernel, SpinQuant, fused RMSNorm from QNN 2.25, and model sharding
    • Added support for the AI Hub model format
    • Implemented Intermediate tensor logging
  • ARM
    • Added new operators
      • add, addmm, avg_pool2d, batch_norm, bmm, clone/cat, conv2d improvements, div, ecp, full, hardtanh, logaddm, mean_dim, mul, permute, relu, sigmoid, slice, softmax, sub, unsqueeze, view
    • Added/enabled lowering passes to improve network compatibility
    • Improved quantization support
      • Made quantization accuracy improvements for all models
      • Added quantization coverage for all available ops
    • Improved channel last support by reducing overhead and number of conversions
    • Added performance measurements on Corstone-300 FVP for Ethos-U55
    • Moved to new compilation flow in Vela to provide better performance and compatibility
    • Improved code documentation for third party contributors
  • XNNPACK
    • Enhanced XNNPACK backend performance
    • Added support for new LLaMa models and other quantized LLMs on Android/iOS devices, including LLaMA 3 8B, 3.1 8B, and 3.2 1B/3B
    • Introduced major partitioner refactor to improve UX and stability
    • Improved model coverage to ensure better stability
  • Vulkan
    • Made latency optimizations for Vulkan convolution and matrix multiplication compute shaders through various algorithmic improvements
    • Added quantizer for 8 bit weight-only quantization
    • Expanded operator coverage to 63 ops
    • Added 4-bit and 8-bit weight quantized linear kernels
    • Added support for view tensors in the Vulkan graph runtime, allowing for no-copy permutes, squeeze/unsqueeze etc.
    • Added support for symbolic integers in the Vulkan graph runtime
    • Integration with ExecuTorch SDK to track compute shader latencies
  • Cadence
    • Added an x86 executor to sanity check and numerically verify models locally
    • Added multiple supported e2e models such as wav2vec2
    • Integrated low-level optimizations resulting in 10x+ performance improvements
    • Migrated more graph-level optimizations to the open source repository
    • Enabled more types in the CadenceQuantizer, and moved to int8 default for better performance

Developer Experience

  • Introduced API to enable intermediate output logging in delegates
  • Improved CMake build system and reduced reliance on Buck2
  • Added override options for fallback PAL implementations through CMake flag (-DEXECUTORCH_PAL_DEFAULT)
  • Changes to DimOrder (please see this issue for current progress and next steps)

Bug Fixes

  • Fixed various issues related to quantization, tensor operations, and backend integrations
  • Resolved memory allocation and management issues
  • Fixed compatibility issues with different Python and dependency versions
  • Fixed bundled program and plan_execute in pybindings

Breaking Changes

  • Updated the minimum C++ version to C++17 for the core runtime
  • Removed all C++ headers under //executorch/util (see extension/runner_util/inputs.h for a PrepareInputTensors replacement)
    • Users are expected now to provide their own read_file.h functionality
  • Renamed instances of sdk to devtools for file names, function names, and CMake options

Deprecation

  • Added new annotations and decorators for API lifecycle and deprecation management
    • New ET_EXPERIMENTAL annotation indicates C++ APIs that may change without notice
    • New @deprecated and @experimental python decorators indicate non-stable APIs
  • Names under the torch:: namespace are deprecated in favor of names under the executorch:: namespace, please migrate code to use the new namespace and avoid adding new references to the torch:: namespace
  • Constant buffers are no longer stored inside the .pte flatbuffer and are stored in a segment attached to the .pte moving forward
  • All C++ macros beginning with underscores such as __ET_UNUSED are deprecated in favor of unprefixed names such as ET_UNUSED
  • capture_pre_autograd_graph() is deprecated in lieu of the new torch.export_for_training() API

Thanks to the following open source contributors for their work on this release!

denisVieriu97, Erik-Lundell, Esteb37, SaoirseARM, benkli01, bigfootjon, chuntl, cymbalrush, derekxu, dulinriley, freddan80, haowhsu-quic, namanahuja, neuropilot-captain, oscarandersson8218, per, python3kgae, r-barnes, robell, salykova, shewu-quic, tom-arm, winskuo-quic, zingo

Full Changelog: v0.3.0...v0.4.0