-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSOC24] Addition of CUDA and GPU Acceleration to FGMRES Linear Solver in SU2 #2346
base: develop
Are you sure you want to change the base?
Changes from 31 commits
34c0bda
c953af8
231968d
4bddc30
2da57f6
aa35779
8866936
12f01db
d7cbd5e
103ec39
38d658b
02b9eb8
1a902ae
3fa00a9
57ffb74
b8f14d3
3a000e2
c31b9df
e131308
7695314
809c2d0
9832f33
1f41592
de0810a
66878d1
729bfc8
3f351db
f363809
a0e09d7
b489d09
4926d34
c691960
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
/*! | ||
* \file GPU_lin_alg.cuh | ||
* \brief Declaration of the GPU Matrix Vector Product CUDA Kernel. | ||
* The implemtation is in <i>GPU_lin_alg.cu</i>. | ||
* \author A. Raj | ||
* \version 8.0.1 "Harrier" | ||
* | ||
* SU2 Project Website: https://su2code.github.io | ||
* | ||
* The SU2 Project is maintained by the SU2 Foundation | ||
* (http://su2foundation.org) | ||
* | ||
* Copyright 2012-2024, SU2 Contributors (cf. AUTHORS.md) | ||
* | ||
* SU2 is free software; you can redistribute it and/or | ||
* modify it under the terms of the GNU Lesser General Public | ||
* License as published by the Free Software Foundation; either | ||
* version 2.1 of the License, or (at your option) any later version. | ||
* | ||
* SU2 is distributed in the hope that it will be useful, | ||
* but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU | ||
* Lesser General Public License for more details. | ||
* | ||
* You should have received a copy of the GNU Lesser General Public | ||
* License along with SU2. If not, see <http://www.gnu.org/licenses/>. | ||
*/ | ||
|
||
#include<cuda_runtime.h> | ||
#include"../../include/linear_algebra/CSysMatrix.hpp" | ||
#include"iostream" | ||
|
||
/*! | ||
* \brief assert style function that reads return codes after intercepting CUDA API calls. | ||
* It returns the result code and its location if the call is unsuccessful. | ||
* \param[in] code - result code of CUDA function | ||
* \param[in] file - name of file holding the function | ||
* \param[in] line - line containing the function | ||
*/ | ||
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true) | ||
{ | ||
if (code != cudaSuccess) | ||
{ | ||
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); | ||
if (abort) exit(code); | ||
} | ||
} | ||
|
||
/*! | ||
* \brief CUDA Kernel that performs the Matrix Vector Product: All threads execute product += matrix*vector | ||
* \param[in] matrix - matrix to be multiplied | ||
* \param[in] vec - vector to multiply the matrix with | ||
* \param[in] prod - storing the output of the operation | ||
* \param[in] d_row_ptr - a device array of pointers pointing to the first non-zero element in each row of the block matrix | ||
* \param[in] d_col_ind - a device array holding the column index of each element of the block matrix | ||
* \param[in] nPointDomain - number of real points of the mesh | ||
* \param[in] nVar - number of variables of the problem | ||
* \param[in] nEqn - number of equations of the problem | ||
*/ | ||
|
||
template<typename matrixType, typename vectorType> | ||
__global__ void GPUMatrixVectorProductAdd(matrixType* matrix, vectorType* vec, vectorType* prod, unsigned long* d_row_ptr, unsigned long* d_col_ind, unsigned long nPointDomain, unsigned long nVar, unsigned long nEqn); |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
NVBLAS_CPU_BLAS_LIB /path/to/libopenblas.so | ||
|
||
NVBLAS_GPU_LIST ALL | ||
|
||
NVBLAS_TILE_DIM 2048 | ||
|
||
NVBLAS_AUTOPIN_MEM_ENABLED |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
/*! | ||
* \file GPU_lin_alg.cu | ||
* \brief Implementation of Matrix Vector Product CUDA Kernel | ||
* \author A. Raj | ||
* \version 8.0.1 "Harrier" | ||
* | ||
* SU2 Project Website: https://su2code.github.io | ||
* | ||
* The SU2 Project is maintained by the SU2 Foundation | ||
* (http://su2foundation.org) | ||
* | ||
* Copyright 2012-2024, SU2 Contributors (cf. AUTHORS.md) | ||
* | ||
* SU2 is free software; you can redistribute it and/or | ||
* modify it under the terms of the GNU Lesser General Public | ||
* License as published by the Free Software Foundation; either | ||
* version 2.1 of the License, or (at your option) any later version. | ||
* | ||
* SU2 is distributed in the hope that it will be useful, | ||
* but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU | ||
* Lesser General Public License for more details. | ||
* | ||
* You should have received a copy of the GNU Lesser General Public | ||
* License along with SU2. If not, see <http://www.gnu.org/licenses/>. | ||
*/ | ||
|
||
#include "../../include/linear_algebra/CSysMatrix.hpp" | ||
#include "../../include/linear_algebra/GPU_lin_alg.cuh" | ||
|
||
#ifndef gpuErrChk | ||
#define gpuErrChk(ans) { gpuAssert((ans), __FILE__, __LINE__); } | ||
#endif | ||
|
||
template<typename matrixType, typename vectorType> | ||
__global__ void GPUMatrixVectorProductAdd(matrixType* matrix, vectorType* vec, vectorType* prod, unsigned long* d_row_ptr, unsigned long* d_col_ind, unsigned long nPointDomain, unsigned long nVar, unsigned long nEqn) | ||
{ | ||
|
||
int i = blockIdx.x * blockDim.x + threadIdx.x; | ||
int j = threadIdx.y; | ||
int k = threadIdx.z; | ||
|
||
int prod_index = i * nVar; | ||
|
||
if(i<nPointDomain) | ||
{ | ||
prod[prod_index + j] = 0.0; | ||
} | ||
|
||
__syncthreads(); | ||
|
||
vectorType res = 0.0; | ||
|
||
if(i<nPointDomain) | ||
{ | ||
for(int index = d_row_ptr[i]; index<d_row_ptr[i+1]; index++) | ||
{ | ||
int matrix_index = index * nVar * nEqn; | ||
int vec_index = d_col_ind[index] * nEqn; | ||
|
||
res += matrix[matrix_index + (j * nEqn + k)] * vec[vec_index + k]; | ||
} | ||
|
||
atomicAdd(&prod[prod_index + j],res); | ||
} | ||
|
||
} | ||
|
||
|
||
template<class ScalarType> | ||
void CSysMatrix<ScalarType>::GPUMatrixVectorProduct(const CSysVector<ScalarType>& vec, CSysVector<ScalarType>& prod, | ||
CGeometry* geometry, const CConfig* config) const | ||
{ | ||
|
||
ScalarType* d_vec; | ||
ScalarType* d_prod; | ||
|
||
unsigned long mat_size = nnz*nVar*nEqn; | ||
unsigned long vec_size = nPointDomain*nVar; | ||
|
||
gpuErrChk(cudaMalloc((void**)(&d_vec), (sizeof(ScalarType)*vec_size))); | ||
gpuErrChk(cudaMalloc((void**)(&d_prod), (sizeof(ScalarType)*vec_size))); | ||
|
||
gpuErrChk(cudaMemcpy((void*)(d_matrix), (void*)&matrix[0], (sizeof(ScalarType)*mat_size), cudaMemcpyHostToDevice)); | ||
gpuErrChk(cudaMemcpy((void*)(d_vec), (void*)&vec[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice)); | ||
gpuErrChk(cudaMemcpy((void*)(d_prod), (void*)&prod[0], (sizeof(ScalarType)*vec_size), cudaMemcpyHostToDevice)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you don't need to copy the product, you just need to memset to 0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, will add this. Thank you |
||
|
||
double xDim = (double) 1024.0/(nVar*nEqn); | ||
dim3 blockDim(floor(xDim), nVar, nEqn); | ||
double gridx = (double) nPointDomain/xDim; | ||
dim3 gridDim(ceil(gridx), 1, 1); | ||
|
||
Comment on lines
+88
to
+92
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you document the choice of work distribution between blocks and threads? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
GPUMatrixVectorProductAdd<<<gridDim, blockDim>>>(d_matrix, d_vec, d_prod, d_row_ptr, d_col_ind, nPointDomain, nVar, nEqn); | ||
gpuErrChk( cudaPeekAtLastError() ); | ||
|
||
gpuErrChk(cudaMemcpy((void*)(&prod[0]), (void*)d_prod, (sizeof(ScalarType)*vec_size), cudaMemcpyDeviceToHost)); | ||
|
||
gpuErrChk(cudaFree(d_vec)); | ||
gpuErrChk(cudaFree(d_prod)); | ||
|
||
} | ||
|
||
template class CSysMatrix<su2mixedfloat>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this based on some publication? Did you experiment with other divisions of work?
For example, I see you are going for coalesced access to the matrix blocks, but this requires multiple reads of the same vector entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't experimented with different implementations as I went with this because it seemed optimal. It does access the same vector elements repeatedly while going through an entire row.
Haven't particularly checked out publications yet but you're right, there may be a better way to do this and I'll look into them. If you do have some recommendations on improvements then let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the current bottleneck is based on memory copy between the CPU and GPU while the kernel launch itself is 20 times faster than the repeated copy. Any insights on that as well would be very grateful.
Our current approach to circumvent this is to port not only single subroutines like matrix vector multiplication but the entire Krylov Solver loop where it searches over all the search directions in the subspace to the GPU. This would cut down on the repeated memory transfers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do something like a flag in CSysMatrix that is set to true when the matrix is uploaded to GPU and set to false when the matrix changes (for example when we clear the matrix to write new blocks).
You can use pinned host memory to make the transfers faster.
You can try uploading the matrix in chunks and overlap the uploads with the CPU work of filling another chunk.
Ultimately, the issue of transferring the matrix only goes away by porting the entire code 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding recommendations I have to be a bit cryptic because of my current job, but the general goals are coalesced access, and avoid reading or writing the same global memory location more than once.
I read this paper before my current job.
Optimization of Block Sparse Matrix-Vector Multiplication on Shared-Memory
Parallel Architectures
But like you said, there is much more to gain by porting more linear algebra operations than to micro-optimize the multiplications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go through the paper and follow up the pinned memory lead
Will continue to work on porting the solver, lets hope we get some interesting work in time for the conference 😄
If necessary, I'll contact you with updates either on this thread or catch you in the next dev meeting. Thanks for the help Pedro