Releases · saadrahim/rocBLAS

10 May 23:13

saadrahim

rocm-4.2.0

1d39833

rocBLAS-2.38.0 for ROCm 4.2.0 Latest

Latest

Added

Added option to install script to build only rocBLAS clients with a pre-built rocBLAS library
Supported gemm ext for unpacked int8 input layout on gfx908 GPUs
- Added new flags rocblas_gemm_flags::rocblas_gemm_flags_pack_int8x4 to specify if using the packed layout
  - Set the rocblas_gemm_flags_pack_int8x4 when using packed int8x4, this should be always set on GPUs before gfx908.
  - For gfx908 GPUs, unpacked int8 is supported so no need to set this flag.
  - Notice the default flags 0 uses unpacked int8, this somehow changes the behaviour of int8 gemm from ROCm 4.1.0
Added a query function rocblas_query_int8_layout_flag to get the preferable layout of int8 for gemm by device

Optimizations

Improved performance of single precision copy, swap, and scal when incx == 1 and incy == 1.
Improved performance of single precision axpy when incx == 1, incy == 1 and batch_count =< 8192.
Improved performance of trmm.

Changed

Change cmake_minimum_required to VERSION 3.16.8

Assets 2

23 Mar 01:06

saadrahim

rocm-4.1.0-test

93c8293

rocBLAS-2.36.0 for ROCm 4.1.0

Added

Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output vectors of rocBLAS level 1 and 2 functions.
Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output general matrices of rocBLAS level 2 and 3 functions.

Fixed
Fixed complex unit test bug caused by incorrect caxpy and zaxpy function signatures.
Make functions compliant with Legacy Blas for special values alpha == 0, k == 0, beta == 1, beta == 0.

Optimizations
Improved performance of single precision axpy_batched and axpy_strided_batched: batch_count >= 8192.

Known Issues

None

Assets 2

23 Mar 00:54

saadrahim

rocm-4.1.0

93c8293

rocBLAS-2.36.0 for ROCm 4.1.0

New Features
Added

Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output vectors of rocBLAS level 1 and 2 functions.
Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output general matrices of rocBLAS level 2 and 3 functions.

Fixed
Fixed complex unit test bug caused by incorrect caxpy and zaxpy function signatures.
Make functions compliant with Legacy Blas for special values alpha == 0, k == 0, beta == 1, beta == 0.

Optimizations
Improved performance of single precision axpy_batched and axpy_strided_batched: batch_count >= 8192.

Known Issues

None

Assets 2

27 Oct 20:05

saadrahim

rocm-3.9.0

91e553c

rocBLAS-2.30.0 for ROCm 3.9.0

New Features

Slight improvements to FP16 Megatron BERT performance on MI50
Improvements to FP16 Transformer performance on MI50
Slight improvements to FP32 Transformer performance on MI50

Known Issues

None

Assets 2

01 Jun 19:26

saadrahim

3.5.0

b2cceba

rocBLAS-2.22.0 for ROCm 3.5.0

New Features

add geam complex, geam_batched, and geam_strided_batched
add dgmm, dgmm_batched, and dgmm_strided_batched
Optimized performance
- ger
  - rocblas_sger, rocblas_dger
  - rocblas_sger_batched, rocblas_dger_batched
  - rocblas_sger_strided_batched, rocblas_dger_strided_batched
- geru
  - rocblas_cgeru, rocblas_zgeru
  - rocblas_cgeru_batched, rocblas_zgeru_batched
  - rocblas_cgeru_strided_batched, rocblas_zgeru_strided_batched
- gerc
  - rocblas_cgerc, rocblas_zgerc
  - rocblas_cgerc_batched, rocblas_zgerc_batched
  - rocblas_cgerc_strided_batched, rocblas_zgerc_strided_batched
- symv
  - rocblas_ssymv, rocblas_dsymv, rocblas_csymv, rocblas_zsymv
  - rocblas_ssymv_batched, rocblas_dsymv_batched, rocblas_csymv_batched, rocblas_zsymv_batched
  - rocblas_ssymv_strided_batched, rocblas_dsymv_strided_batched, rocblas_csymv_strided_batched, rocblas_zsymv_strided_batched
- sbmv
  - rocblas_ssbmv, rocblas_dsbmv
  - rocblas_ssbmv_batched, rocblas_dsbmv_batched
  - rocblas_ssbmv_strided_batched, rocblas_dsbmv_strided_batched
- spmv
  - rocblas_sspmv, rocblas_dspmv
  - rocblas_sspmv_batched, rocblas_dspmv_batched
  - rocblas_sspmv_strided_batched, rocblas_dspmv_strided_batched
Improved documentation
Fix argument checking in functions to match legacy BLAS
Fixed conjugate-transpose version of geam

Known Issues

None

Assets 2

15 Aug 04:19

saadrahim

3.7.0_1

76b1c25

rocBLAS-2.24.0 for ROCm 3.7.0

New Features

Improvements to User Guide and Design Document
L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
L1 dot function added x dot x optimized kernel
Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
Added Fortran interface for all rocBLAS functions

Known Issues

None

Assets 2

15 Aug 04:17

saadrahim

3.7.0

76b1c25

rocBLAS-2.24.0 for ROCm 3.7.0

New Features

Improvements to User Guide and Design Document
L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
L1 dot function added x dot x optimized kernel
Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
Added Fortran interface for all rocBLAS functions

Known Issues

None

Assets 2

10 Jul 23:14

saadrahim

3.6beta6

76b1c25

rocBLAS-2.24.0 for ROCm 3.6.0

New Features

Improvements to User Guide and Design Document
L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
L1 dot function added x dot x optimized kernel
Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
Added Fortran interface for all rocBLAS functions

Known Issues

None

Assets 2

10 Jul 23:13

saadrahim

3.6beta5

76b1c25

rocBLAS-2.24.0 for ROCm 3.6.0

New Features

Improvements to User Guide and Design Document
L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
L1 dot function added x dot x optimized kernel
Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
Added Fortran interface for all rocBLAS functions

Known Issues

None

Assets 2

10 Jul 23:07

saadrahim

3.6beta4

76b1c25

rocBLAS-2.24.0 for ROCm 3.6.0

New Features

Improvements to User Guide and Design Document
L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
L1 dot function added x dot x optimized kernel
Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
Added Fortran interface for all rocBLAS functions

Known Issues

None

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added

Optimizations

Changed

Releases: saadrahim/rocBLAS

rocBLAS-2.38.0 for ROCm 4.2.0

Added

Optimizations

Changed

rocBLAS-2.36.0 for ROCm 4.1.0

rocBLAS-2.36.0 for ROCm 4.1.0

rocBLAS-2.30.0 for ROCm 3.9.0

rocBLAS-2.22.0 for ROCm 3.5.0

rocBLAS-2.24.0 for ROCm 3.7.0

rocBLAS-2.24.0 for ROCm 3.7.0

rocBLAS-2.24.0 for ROCm 3.6.0

rocBLAS-2.24.0 for ROCm 3.6.0

rocBLAS-2.24.0 for ROCm 3.6.0