Skip to content

Releases: saadrahim/rocBLAS

rocBLAS-2.38.0 for ROCm 4.2.0

10 May 23:13
1d39833
Compare
Choose a tag to compare

Added

  • Added option to install script to build only rocBLAS clients with a pre-built rocBLAS library
  • Supported gemm ext for unpacked int8 input layout on gfx908 GPUs
    • Added new flags rocblas_gemm_flags::rocblas_gemm_flags_pack_int8x4 to specify if using the packed layout
      • Set the rocblas_gemm_flags_pack_int8x4 when using packed int8x4, this should be always set on GPUs before gfx908.
      • For gfx908 GPUs, unpacked int8 is supported so no need to set this flag.
      • Notice the default flags 0 uses unpacked int8, this somehow changes the behaviour of int8 gemm from ROCm 4.1.0
  • Added a query function rocblas_query_int8_layout_flag to get the preferable layout of int8 for gemm by device

Optimizations

  • Improved performance of single precision copy, swap, and scal when incx == 1 and incy == 1.
  • Improved performance of single precision axpy when incx == 1, incy == 1 and batch_count =< 8192.
  • Improved performance of trmm.

Changed

  • Change cmake_minimum_required to VERSION 3.16.8

rocBLAS-2.36.0 for ROCm 4.1.0

23 Mar 01:06
93c8293
Compare
Choose a tag to compare

Added

  • Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output vectors of rocBLAS level 1 and 2 functions.
  • Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output general matrices of rocBLAS level 2 and 3 functions.

Fixed
Fixed complex unit test bug caused by incorrect caxpy and zaxpy function signatures.
Make functions compliant with Legacy Blas for special values alpha == 0, k == 0, beta == 1, beta == 0.

Optimizations
Improved performance of single precision axpy_batched and axpy_strided_batched: batch_count >= 8192.

Known Issues

  • None

rocBLAS-2.36.0 for ROCm 4.1.0

23 Mar 00:54
93c8293
Compare
Choose a tag to compare

New Features
Added

  • Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output vectors of rocBLAS level 1 and 2 functions.
  • Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output general matrices of rocBLAS level 2 and 3 functions.

Fixed
Fixed complex unit test bug caused by incorrect caxpy and zaxpy function signatures.
Make functions compliant with Legacy Blas for special values alpha == 0, k == 0, beta == 1, beta == 0.

Optimizations
Improved performance of single precision axpy_batched and axpy_strided_batched: batch_count >= 8192.

Known Issues

  • None

rocBLAS-2.30.0 for ROCm 3.9.0

27 Oct 20:05
91e553c
Compare
Choose a tag to compare

New Features

  • Slight improvements to FP16 Megatron BERT performance on MI50
  • Improvements to FP16 Transformer performance on MI50
  • Slight improvements to FP32 Transformer performance on MI50

Known Issues

  • None

rocBLAS-2.22.0 for ROCm 3.5.0

01 Jun 19:26
b2cceba
Compare
Choose a tag to compare

New Features

  • add geam complex, geam_batched, and geam_strided_batched

  • add dgmm, dgmm_batched, and dgmm_strided_batched

  • Optimized performance

    • ger
      • rocblas_sger, rocblas_dger
      • rocblas_sger_batched, rocblas_dger_batched
      • rocblas_sger_strided_batched, rocblas_dger_strided_batched
    • geru
      • rocblas_cgeru, rocblas_zgeru
      • rocblas_cgeru_batched, rocblas_zgeru_batched
      • rocblas_cgeru_strided_batched, rocblas_zgeru_strided_batched
    • gerc
      • rocblas_cgerc, rocblas_zgerc
      • rocblas_cgerc_batched, rocblas_zgerc_batched
      • rocblas_cgerc_strided_batched, rocblas_zgerc_strided_batched
    • symv
      • rocblas_ssymv, rocblas_dsymv, rocblas_csymv, rocblas_zsymv
      • rocblas_ssymv_batched, rocblas_dsymv_batched, rocblas_csymv_batched, rocblas_zsymv_batched
      • rocblas_ssymv_strided_batched, rocblas_dsymv_strided_batched, rocblas_csymv_strided_batched, rocblas_zsymv_strided_batched
    • sbmv
      • rocblas_ssbmv, rocblas_dsbmv
      • rocblas_ssbmv_batched, rocblas_dsbmv_batched
      • rocblas_ssbmv_strided_batched, rocblas_dsbmv_strided_batched
    • spmv
      • rocblas_sspmv, rocblas_dspmv
      • rocblas_sspmv_batched, rocblas_dspmv_batched
      • rocblas_sspmv_strided_batched, rocblas_dspmv_strided_batched
  • Improved documentation

  • Fix argument checking in functions to match legacy BLAS

  • Fixed conjugate-transpose version of geam

Known Issues

None

rocBLAS-2.24.0 for ROCm 3.7.0

15 Aug 04:19
Compare
Choose a tag to compare

New Features

  • Improvements to User Guide and Design Document
  • L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
  • L1 dot function added x dot x optimized kernel
  • Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
  • Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
  • Added Fortran interface for all rocBLAS functions

Known Issues

  • None

rocBLAS-2.24.0 for ROCm 3.7.0

15 Aug 04:17
Compare
Choose a tag to compare

New Features

  • Improvements to User Guide and Design Document
  • L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
  • L1 dot function added x dot x optimized kernel
  • Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
  • Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
  • Added Fortran interface for all rocBLAS functions

Known Issues

  • None

rocBLAS-2.24.0 for ROCm 3.6.0

10 Jul 23:14
Compare
Choose a tag to compare

New Features

  • Improvements to User Guide and Design Document
  • L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
  • L1 dot function added x dot x optimized kernel
  • Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
  • Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
  • Added Fortran interface for all rocBLAS functions

Known Issues

  • None

rocBLAS-2.24.0 for ROCm 3.6.0

10 Jul 23:13
Compare
Choose a tag to compare

New Features

  • Improvements to User Guide and Design Document
  • L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
  • L1 dot function added x dot x optimized kernel
  • Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
  • Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
  • Added Fortran interface for all rocBLAS functions

Known Issues

  • None

rocBLAS-2.24.0 for ROCm 3.6.0

10 Jul 23:07
Compare
Choose a tag to compare

New Features

  • Improvements to User Guide and Design Document
  • L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
  • L1 dot function added x dot x optimized kernel
  • Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
  • Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
  • Added Fortran interface for all rocBLAS functions

Known Issues

  • None