-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586
base: 2023.06-software.eessi.io
Are you sure you want to change the base?
[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586
Conversation
…-layer into 2023.06-software.eessi.io-cuDNN-8.9.2.26-system
- `EESSI-install-software.sh` - use `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` with `scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml` - `create_lmodsitepackage.py` - consolidate `eessi_{cuda,cudnn}_enabled_load_hook` functions in a single one (`eessi_cuda_and_libraries_enabled_load_hook`) - the remaining hook is prepared to easily add new modules, e.g., cuTENSOR - `eb_hooks.py` - put code that iterates over all files replacing non-distributable ones with symlinks into `host_injections` with a common function (`replace_non_distributable_files_with_symlinks`) - `install_scripts.sh` - add files to copy to CVMFS (see `nvidia_files`) - `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` - improved creation of tmp directory
Instance
|
Instance
|
We run a first attempt without doing any modifications (e.g., to work around issues)... bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Building after applied changes provided by #579... bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Trying again... bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
Cleaned up code for creating/updating Lmod cfg files ( bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Commented out code (in bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Added bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Try again... bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Now, try building for multiple compute capabilities ( bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Currently not actively being worked on because we need to have rework/implement support for building for GPUs which also depends on support for dev.eessi.io |
WORK IN PROGRESS
Eventually, this is aimed at adding PyTorch/2.1.2 with CUDA/12.1.1. However, building it may not work out of the box, so this is for documenting the progress, issues we hit and workarounds applied.
PyTorch with CUDA requires cuDNN, hence this PR also builds it using the same changes provided by #581 and #579 (however, the changes by the latter would have to be ingested, hence we need additional changes here; we try to document well what we do, and why).
Initially, we only build for compute capability
7.0
, later we build for architectures fromPascal
but excluding architectures for embedded GPUs and very special compute capabilities such as9.0a
. That is the list of compute capabilities could be6.0,6.1,7.0,7.5,8.0,8.6,8.9,9.0