Releases: intel/neural-compressor
Releases · intel/neural-compressor
Intel Neural Compressor Release 3.2
- Highlights
- Features
- Improvements
- Bug Fixes
- Validated Hardware
- Validated Configurations
Highlights
- Aligned with Habana 1.19 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
- INT4 weight-only quantization on Intel® Arc™ B-Series Graphics GPU (code-named BattleMage)
Features
- Saving and loading FP8 checkpoint on Gaudi
- Loading vLLM/llm-compressor compatible FP8 checkpoint on Gaudi
- Arbitrary scale method support on Gaudi
- AutoRound INT4 weight-only quantization on Gaudi
- Block-wise calibration for LLM on Gaudi
- INT4 weight-only quantization on BattleMage
Improvements
- Improve FP8 performance by setting scale as scalar tensor on Gaudi
- Integrate AutoRound 0.4.2 with VLM quantization improvements
- Improve safetensors loading for layer-wise quantization in Transformers-like API
- Improve non-contiguous weight saving in Transformers-like API
Bug Fixes
- Fix layer-wise quantization issue in GPTQ on client GPU
- Fix glm-4-9b model out-of-memory issue on BattleMage
Validated Hardware
- Intel Gaudi Al Accelerators (Gaudi 2 and 3)
- Intel Xeon Scalable processor (4th, 5th, 6th Gen)
- Intel Core Ultra Processors (Series 1 and 2)
- Intel Data Center GPU Max Series (1100)
- Intel Arc B-Series Graphics GPU (B580)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win 11
- Python 3.9, 3.10, 3.11, 3.12
- PyTorch/IPEX 2.3, 2.4, 2.5
Intel Neural Compressor Release 3.1
- Highlights
- Features
- Improvements
- Validated Hardware
- Validated Configurations
Highlights
- Aligned with Habana 1.18 release with the improvements on FP8 and INT4 quantization for Intel® Gaudi® AI accelerator
- Provided Transformer-like quantization API for weight-only quantization on LLM, which offers transformer-based user one-stop experience for quantization & inference with IPEX on Intel GPU and CPU.
Features
- Add Transformer-like quantization API for weight-only quantization on LLM
- Support fast quantization with light weight recipe and layer-wise approach on Intel AI PC
- Support INT4 quantization of Visual Language Model (VLM), like Llava, Phi-3-vision, Qwen-VL with AutoRound algorithm
Improvements
- Support AWQ format INT4 model loading and converting for IPEX inference in Transformer-like API
- Enable auto-round format export for INT4 model
- Support per-channel INT8 Post Training Quantization for PT2E
Validated Hardware
- Intel Gaudi Al Accelerators (Gaudi 2 and 3)
- Intel Xeon Scalable processor (4th, 5th, 6th Gen)
- Intel Core Ultra Processors (Series 1 and 2)
- Intel Data Center GPU Max Series (1100)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win 11
- Python 3.9, 3.10, 3.11, 3.12
- PyTorch/IPEX 2.2, 2.3, 2.4
Intel® Neural Compressor v3.0 Release
- Highlights
- Features
- Improvements
- Examples
- Bug Fixes
- Documentations
- Validated Configurations
Highlights
- FP8 quantization and INT4 model loading support on Intel® Gaudi® AI accelerator
- Framework extension API for quantization, mixed-precision and benchmarking
- Accuracy-aware FP16 mixed precision support on Intel® Xeon® 6 Processors
- Performance optimizations and usability improvements on client-side quantization
Features
- [Quantization] Support FP8 quantization on Gaudi (95197d)
- [Quantization] Support INC and Hugging Face model loading on framework extension API for PyTorch (0eced1, bacc16)
- [Quantization] Support Weight-only Quantization on framework extension API for PyTorch (34f0a9, de43d8, 4a4509, 1a4509, a3a065, 1386ac, a0dee9, 503d9e, 84d705, 099b7a, e3c736, e87c95, 2694bb, ec49a2, e7b4b6, a9bf79, ac717b, 915018, 8447d7, dc9328)
- [Quantization] Support static and dynamic quantization in PT2E path (7a4715, 43c358, 30b36b, 1f58f0, 02958d)
- [Quantization] Support SmoothQuant and static quantization in IPEX path with framework extension API (53e6ee, 72fbce, eaa3a5, 95e67e, 855c10, 9c6102, 5dafe5, a5e5f5, 191383, 776645)
- [Quantization] Support Layer-wise Quantization for RTN/GPTQ on framework extension API for PyTorch (649e6b)
- [Quantization] Support Post Training Quantization on framework extension API for Tensorflow (6c27c1, e22c61, f21afb, 3882e9, 2627d3)
- [Quantization] Support Post Training Quantization on Keras3 (f67e86, 047560)
- [Quantization] Support Weight-only Quantization on Gaudi2 (4b9b44, 14868c, 0a3d4b)
- [Quantization] Improve performance and usability of quantization procedure on client side (16a7b1)
- [Quantization] Support auto-device detection on framework extension API for PyTorch (368ba5, 4b9b44, e81a2d, 0a3d4b, 534300, 2a86ae)
- [Quantization] Support Microscaling(MX) Quant for PyTorch (4a24a6, 455f1e)
- [Quantization] Enable cross-devices Half-Quadratic Quantization(HQQ) for LLMs support (db6164, 07f940)
- [Quantization] Support FP8 cast Weight-only Quantization (57ed61)
- [Mixed-Precision] Support FP16 mixed-precision on framework extension autotune API for PyTorch (2e1cdc)
- [Mixed-Precision] Support mixed
INT8
withFP16
in PT2E path (fa961e) - [AutoTune] Support accuracy-aware tuning on framework extension API (e97659, 7b8aec, 5a0374, a4675c, 3a254e, ac47d9, b8d98e, fb6142, fa8e66, d22df5, 09eb5d, c6a8fa)
- [Benchmarking] Implement
incbench
command for ease-of-use benchmark (2fc725)
Improvements
- [Quantization] Integrate AutoRound v0.3 (bfa27e, [fd9...
Intel® Neural Compressor v2.6 Release
- Highlights
- Features
- Improvements
- Examples
- Bug Fixes
- External Contributions
- Validated Configurations
Highlights
- Integrated recent AutoRound with lm-head quantization support and calibration process optimizations
- Migrated ONNX model quantization capability into ONNX project Neural Compressor
Features
- [Quantization] Integrate recent AutoRound with lm-head quantization support and calibration process optimizations (4728fd)
- [Quantization] Support true sequential options in GPTQ (92c942)
Improvements
- [Quantization] Improve WOQ Linear pack/unpack speed with numpy implementation (daa143)
- [Quantization] Auto detect available device when exporting (7be355)
- [Quantization] Refine AutoRound export to support Intel GPU (409231)
- [Benchmarking] Detect the number of sockets when needed (e54b93)
Examples
- Upgrade lm_eval to 0.4.2 in PT and ORT LLM example (fdb509) (54f039)
- Add diffusers/dreambooth example with IPEX (ba4798)
Bug Fixes
- Fix incorrect dtype of unpacked tensor issue in PT (29fdec)
- Fix TF LLM SQ legacy Keras environment variable issue (276449)
- Fix TF estimator issue by adding version check on TF2.16 (855b98)
- Fix missing tokenizer issue in run_clm_no_trainer.py after using lm-eval 0.4.2 (d64029)
- Fix AWQ padding issue in ORT (903da4)
- Fix recover function issue in ORT (ee24db)
- Update model ckpt download url in prepare_model.py (0ba573)
- Fix case where pad_max_length set to None (960bd2)
- Fix a failure for GPU backend (71a9f3)
- Fix numpy versions for rnnt and 3d-unet examples (12b8f4)
- Fix CVEs (5b5579) (25c71a) (47d73b) (41da74)
External Contributions
- Update model ckpt download url in prepare_model.py (0ba573)
- Fix case where pad_max_length set to None (960bd2)
- Add diffusers/dreambooth example with IPEX (ba4798)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
- Python 3.8, 3.9, 3.10, 3.11
- PyTorch/IPEX 2.1, 2.2, 2.3
- TensorFlow 2.14, 2.15, 2.16
- ITEX 2.13.0, 2.14.0, 2.15.0
- ONNX Runtime 1.16, 1.17, 1.18
Intel® Neural Compressor v2.5.1 Release
- Improvement
- Bug Fixes
- Validated Configurations
Improvement
- Improve WOQ AutoRound export (409231, 7ee721)
- Adapt ITREX v1.4 release for example evaluate (9d7a05)
- Update more supported LLM recipes (ce9b16)
Bug Fixes
- Fix WOQ RTN supported layer checking condition (079177)
- Fix in-place processing error in quant_weight function (92533a)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.10
- TensorFlow 2.15
- ITEX 2.14.0
- PyTorch/IPEX 2.2
- ONNX Runtime 1.17
Intel® Neural Compressor v2.5 Release
- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- External Contributes
- Validated Configurations
Highlights
- Integrated Weight-Only Quantization algorithm AutoRound and verified on Gaudi2, Intel CPU, NV GPU
- Applied SmoothQuant & Weight-Only Quantization algorithms with 15+ popular LLMs for INT8 & INT4 quantization and published the recipes
Features
- [Quantization] Integrate Weight-Only Quantization algorithm AutoRound (5c7f33, dfd083, 9a7ddd, cf1de7)
- [Quantization] Quantize weight with in-place mode in Weight-Only Quantization (deb1ed)
- [Pruning] Enable SNIP on multiple cards using DeepSpeed ZeRO-3 (49ab28)
- [Pruning] Support new pruning approach Wanda and DSNOT for PyTorch LLM (7a3671)
Improvement
- [Quantization] SmoothQuant code structure refactor (a8d81c)
- [Quantization] Optimize the workflow of parsing Keras model (b816d7)
- [Quantization] Support static_groups options in GPTQ API (1c426a)
- [Quantization] Update TEQ train dataloader (d1e994)
- [Quantization] WeightOnlyLinear keeps self.weight after recover (2835bd)
- [Quantization] Add version condition for IPEX prepare init (d96e14)
- [Quantization] Enhance the ORT node name checking (f1597a)
- [Pruning] Stop the tuning process early when enabling smooth quant (844a03)
Productivity
- ORT LLM examples support latest optimum version (26b260)
- Add coding style docs and recommended VS Code setting (c1f23c)
- Adapt transformers 4.37 loading (6133f4)
- Upgrade pre-commit checker for black/blacken-docs/ruff (7763ed)
- Support CI summary in PR comments (d4bcdd))
- Notebook example update to install latest INC & TF, add metric in fit (4239d3)
Bug Fixes
- Fix QA IPEX example fp32 input issue (c4de19)
- Update Conditions of Getting min-max during TF MatMul Requantize (d07175)
- Fix TF saved_model issues (d8e60b)
- Fix comparison of module_type and MulLinear (ba3aba)
- Fix ORT calibration issue (cd6d24)
- Fix ORT example bart export failure (b0dc0d)
- Fix TF example accuracy diff during benchmark and quantization (5943ea)
- Fix bugs for GPTQ exporting with static_groups (b4e37b)
- Fix ORT quant issue caused by tensors having same name (0a20f3)
- Fix Neural Solution SQL/CMD injection (14b7b0)
- Fix the best qmodel recovery issue (f2d9b7)
- Fix logger issue (83bc77)
- Store token in protected file (c6f9cc)
- Define the default SSL context (b08725)
- Fix IPEX stats bug (5af383)
- Fix ORT calibration for Dml EP (c58aea)
- Fix wrong socket number retrieval for non-english system (5b2a88)
- Fix trust remote for llm examples (2f2c9a)
External Contributes
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win 11 & MacOS Ventura 13.5
- Python 3.8, 3.9, 3.10, 3.11
- TensorFlow 2.13, 2.14, 2.15
- ITEX 2.13.0, 2.14.0
- PyTorch/IPEX 2.0, 2.1, 2.2
- ONNX Runtime 1.15, 1.16, 1.17
Intel® Neural Compressor v2.4.1 Release
- Improvement
- Bug Fixes
- Examples
- Validated Configurations
Improvement
- Narrow down the tuning space of SmoothQuant auto-tune (9600e1)
- Support ONNXRT Weight-Only Quantization with different dtypes (5119fc)
- Add progress bar for ONNXRT Weight-Only Quantization and SmoothQuant (4d26e3)
Bug Fixes
- Fix SmoothQuant alpha-space generation (33ece9)
- Fix inputs error for SmoothQuant example_inputs (39f63a)
- Fix LLMs accuracy regression with IPEX 2.1.100 (3cb6d3)
- Fix quantizable add ops detection on IPEX backend (4c004d)
- Fix range step bug in ORTSmoothQuant (40275c)
- Fix unit test bugs and update CI versions (6c78df, 835805)
- Fix notebook issues (08221e)
Examples
- Add verified LLMs list and recipes for SmoothQuant and Weight-Only Quantization (f19cc9)
- Add code-generaion evaluation for Weight-Only Quantization GPTQ (763440)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.10
- TensorFlow 2.14
- ITEX 2.14.0.1
- PyTorch/IPEX 2.1.0
- ONNX Runtime 1.16.3
Intel® Neural Compressor v2.4 Release
- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- Examples
- Validated Configurations
Highlights
- Supported layer-wise quantization for PyTorch RTN/GPTQ Weight-Only Quantization and ONNX Runtime W8A8 quantization.
- Supported Weight-Only Quantization tuning for ONNX Runtime backend.
- Supported GGML double quant on RTN/GPTQ Weight-Only Quantization with FW extension API
- Supported SmoothQuant of Big Saved Model for TensorFlow Backend.
Features
- [Quantization] Support GGML double quant in Weight-Only Quantization for RTN and GPTQ (05c15a)
- [Quantization] Support Weight-Only Quantization tuning for ONNX Runtime backend (6d4ea5, 934ba0, 4fcfdf)
- [Quantization] Support SmoothQuant block-wise alpha-tuning (ee6bc2)
- [Quantization] Support SmoothQuant of Big Saved Model for TensorFlow Backend (3b2925, 4f2c35)
- [Quantization] Support PyTorch layer-wise quantization for GPTQ (ee5450)
- [Quantization] support PyTorch layer-wise quantization for RTN (ebd1e2)
- [Quantization] Support ONNX Runtime layer-wise W8A8 quantization (6142e4, 5d33a5)
- [Common] [Experimental] FW extension API implement (76b8b3, 8447d7, 258236)
- [Quantization] [Experimental] FW extension API for PT backend support Weight-Only Quantization (915018, dc9328)
- [Quantization] [Experimental] FW extension API for TF backend support Keras Quantization (2627d3)
- [Quantization] IPEX 2.1 XPU (CPU+GPU) support (af0b50, cf847c)
Improvement
- [Quantization] Add use_optimum_format for export_compressed_model in Weight-Only Quantization (5179da, 0a0644)
- [Quantization] Enhance ONNX Runtime quantization with DirectML EP (db0fef, d13183, 098401, 6cad50)
- [Quantization] Support restore ipex model from json (c3214c)
- [Quantization] ONNX Runtime add attr to MatMulNBits (7057e3)
- [Quantization] Increase SmoothQuant auto alpha running speed (173c18)
- [Quantization] Add SmoothQuant alpha search space as a config argument (f9663d)
- [Quantization] Add SmoothQuant weight_clipping as a default_on option (1f4aec)
- [Quantization] Support SmoothQuant with MinMaxObserver (45b496)
- [Quantization] Support Weight-Only Quantization with fp16 for PyTorch backend (d5cb56)
- [Quantization] Support trace with dictionary type example_inputs (afe315)
- [Quantization] Support falcon Weight-Only Quantization (595d3a)
- [Common] Add deprecation decorator in experimental fold (aeb3ed)
- [Common] Remove 1.x API dependency (ee617a)
- [Mixed Precision] Support PyTorch eager mode BF16 MixedPrecision (3bfb76)
Productivity
- Support quantization and benchmark on macOS (16d6a0)
- Support ONNX Runtime 1.16.0 (d81732, 299af9, 753783)
- Support TensorFlow new API for gnr-base (8160c7)
Bug Fixes
- Fix GraphModule object has no attribute bias (7f53d1)
- Fix ONNX model export issue (af0aea, eaa57f)
- Add clip for ONNX Runtime SmoothQuant (cbb69b)
- Fix SmoothQuant minmax observer init (b1db1c)
- Fix SmoothQuant issue in get/set_module (dffcfe)
- Align sparsity with block-wise masks in progressive pruning (fcdc29)
Examples
- Support peft model with SmoothQuant (5e21b7)
- Enable two ONNX Runtime examples table-transformer-detection (550cee), BEiT (7265df)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win10 & MacOS Ventura 13.5
- Python 3.8, 3.9, 3.10, 3.11
- TensorFlow 2.13, 2.14, 2.15
- ITEX 1.2.0, 2.13.0.0, 2.14.0.1
- PyTorch/IPEX 1.13.0+cpu, 2.0.1+cpu, 2.1.0
- ONNX Runtime 1.14.1, 1.15.1, 1.16.3
- MXNet 1.9.1
Intel® Neural Compressor v2.3.2 Release
- Features
- Bug Fixes
Features
- Reduce memory consumption in ONNXRT adaptor (f64833)
- Support MatMulFpQ4 for onnxruntime 1.16 (1beb43)
- Support MatMulNBits for onnxruntime 1.17 (67a31b)
Bug Fixes
- Update ITREX version in ONNXRT WOQ example and fix bugs in hf models (0ca51a)
- Update ONNXRT WOQ example into llama-2-7b (7f2063)
- Fix ONNXRT WOQ failed with None model_path (cbd0a4)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.10
- TensorFlow 2.13
- ITEX 2.13
- PyTorch/IPEX 2.0.1+cpu
- ONNX Runtime 1.15.1
- MXNet 1.9.1
Intel® Neural Compressor v2.3.1 Release
- Bug Fixes
- Productivity
Bug Fixes
- Fix PyTorch SmoothQuant for auto alpha (e9c14a, 35def7)
- Fix PyTorch SmoothQuant calibration memory overhead (49e950)
- Fix PyTorch SmoothQuant issue in get/set_module (Issue #1265)(6de9ce)
- Support falcon Weight-Only Quantization (bf7b5c)
- Remove Conv2d in Weight-Only Quantization adaptor white list (1a6526)
- Fix TensorFlow ssd_resnet50_v1 Example for TF New API (c63fc5)
Productivity
- Adapt Example for TensorFlow 2.14 AutoTrackable API Change (424cf3)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.10
- TensorFlow 2.13, 2.14
- ITEX 2.13
- PyTorch/IPEX 2.0.1+cpu
- ONNX Runtime 1.15.1
- MXNet 1.9.1