Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #369

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented May 23, 2024

Add PyTorch/2.1.2-foss-2023a-CUDA-12.1.1 to NESSI.

SPDX license identifier: BSD-style

Missing packages:

2 out of 107 required modules missing:

* magma/2.7.2-foss-2023a-CUDA-12.1.1 (magma-2.7.2-foss-2023a-CUDA-12.1.1.eb)
* PyTorch/2.1.2-foss-2023a-CUDA-12.1.1 (PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb)

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Instance AWS-MC-NESSI is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/skylake_avx512, x86_64/amd/zen2, aarch64/generic
  • repositories: nessi-2023.06-swl-deb11, nessi-2023.06-cl, nessi-2023.06-swl-deb10

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Instance eX3-NESSI is configured to build for:

  • architectures: x86_64/amd/zen2, aarch64/generic
  • repositories: nessi-2023.06-cl, nessi-2023.06-swl-deb11, nessi-2023.06-swl-deb10

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Instance Fram-NESSI is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/broadwell
  • repositories: nessi-2023.06-swl-deb10, nessi-2023.06-cl, nessi-2023.06-swl-deb11

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Instance Saga-NESSI is configured to build for:

  • architectures: x86_64/intel/skylake_avx512, x86_64/intel/broadwell, x86_64/generic
  • repositories: nessi-2023.06-cl, nessi-2023.06-swl-deb11, nessi-2023.06-swl-deb10

@trz42
Copy link
Collaborator Author

trz42 commented May 23, 2024

Just a first test on a single architecture...

bot: build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance AWS-MC-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance eX3-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance Fram-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance Saga-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

New job on instance AWS-MC-NESSI for architecture x86_64-amd-zen2 for repository nessi-2023.06-swl-deb11 in job dir /project/def-nessi/SHARED/jobs/2024.05/pr_369/11277

date job status comment
May 23 07:18:57 UTC 2024 submitted job id 11277 awaits release by job manager
May 23 07:19:31 UTC 2024 released job awaits launch by Slurm scheduler
May 23 07:23:33 UTC 2024 running job 11277 is running
May 23 07:55:24 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11277.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
May 23 07:55:24 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 7/7 test case(s) from 7 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11277.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

…into nessi-2023.06-PyTorch-2.1.2-2023a-CUDA-12.1.1
@trz42
Copy link
Collaborator Author

trz42 commented May 23, 2024

Next try after #370 (fix for GPU check)...

bot: build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance AWS-MC-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance Saga-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance Fram-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance eX3-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

New job on instance AWS-MC-NESSI for architecture x86_64-amd-zen2 for repository nessi-2023.06-swl-deb11 in job dir /project/def-nessi/SHARED/jobs/2024.05/pr_369/11286

date job status comment
May 23 09:32:53 UTC 2024 submitted job id 11286 awaits release by job manager
May 23 09:33:37 UTC 2024 released job awaits launch by Slurm scheduler
May 23 09:34:41 UTC 2024 running job 11286 is running
May 23 10:47:02 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11286.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716460007.tar.gzsize: 302 MiB (316723710 bytes)
entries: 113
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
magma/2.7.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
no other files in tarball
May 23 10:47:02 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 7/7 test case(s) from 7 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11286.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented May 23, 2024

Next try after putting source file under shared source path...

bot: build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance AWS-MC-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance eX3-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance Fram-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

Updates by the bot instance Saga-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:zen2 from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:zen2 resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 23, 2024

New job on instance AWS-MC-NESSI for architecture x86_64-amd-zen2 for repository nessi-2023.06-swl-deb11 in job dir /project/def-nessi/SHARED/jobs/2024.05/pr_369/11287

date job status comment
May 23 11:03:10 UTC 2024 submitted job id 11287 awaits release by job manager
May 23 11:04:07 UTC 2024 released job awaits launch by Slurm scheduler
May 23 11:05:09 UTC 2024 running job 11287 is running
May 23 12:20:22 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-11287.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1716465519.tar.gzsize: 302 MiB (316723593 bytes)
entries: 113
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/software
magma/2.7.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2
no other files in tarball
May 23 12:20:22 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 7/7 test case(s) from 7 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11287.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@nessi-bot
Copy link

nessi-bot bot commented May 27, 2024

Updates by the bot instance eX3-NESSI (click for details)
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic resulted in:

@nessi-bot
Copy link

nessi-bot bot commented May 27, 2024

New job on instance eX3-NESSI for architecture aarch64-generic for repository nessi-2023.06-swl-deb10 in job dir /home/thomarob/pilot.nessi.no/jobs/2024.05/pr_369/213246

date job status comment
May 27 08:50:54 PM UTC 2024 submitted job id 213246 awaits release by job manager
May 27 08:51:41 PM UTC 2024 released job awaits launch by Slurm scheduler
May 27 08:52:51 PM UTC 2024 running job 213246 is running
May 31 08:54:17 PM UTC 2024 finished
🤷 UNKNOWN (click triangle for details)
  • Job results file _bot_job213246.result does not exist in job directory, or parsing it failed.
  • No artefacts were found/reported.
May 31 08:54:17 PM UTC 2024 test result
🤷 UNKNOWN (click triangle for details)
  • Job test file _bot_job213246.test does not exist in job directory, or parsing it failed.

…into nessi-2023.06-PyTorch-2.1.2-2023a-CUDA-12.1.1
@trz42
Copy link
Collaborator Author

trz42 commented May 29, 2024

Try job on AWS and with different container (on eX3)...

bot: build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic
bot: build inst:eX3-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance AWS-MC-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic resulted in:

  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance eX3-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic resulted in:

    • no jobs were submitted
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic resulted in:

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance Fram-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic resulted in:

    • no jobs were submitted
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance Saga-NESSI (click for details)
  • received bot command build inst:AWS-MC-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic from trz42

    • expanded format: build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb11 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic
  • handling command build instance:AWS-MC-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic resulted in:

    • no jobs were submitted
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb11 architecture:aarch64/generic resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

New job on instance AWS-MC-NESSI for architecture aarch64-generic for repository nessi-2023.06-swl-deb11 in job dir /project/def-nessi/SHARED/jobs/2024.05/pr_369/11713

date job status comment
May 29 16:20:39 UTC 2024 submitted job id 11713 awaits release by job manager
May 29 16:20:50 UTC 2024 released job awaits launch by Slurm scheduler
May 29 16:26:54 UTC 2024 running job 11713 is running
May 30 02:49:00 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-11713.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1717036716.tar.gzsize: 900 MiB (944297224 bytes)
entries: 12844
modules under 2023.06/software/linux/aarch64/generic/modules/all
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/aarch64/generic/software
magma/2.7.2-foss-2023a-CUDA-12.1.1
PyTorch/2.1.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/aarch64/generic
2023.06/init/easybuild/eb_hooks.py
May 30 02:49:00 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-11713.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
May 30 07:39:18 UTC 2024 uploaded transfer of eessi-2023.06-software-linux-aarch64-generic-1717036716.tar.gz to S3 bucket succeeded
May 30 08:54:02 AM UTC 2024 staged tarball eessi-2023.06-software-linux-aarch64-generic-1717036716.tar.gz downloaded to Stratum-0
May 30 08:59:15 AM UTC 2024 pr_opened merge PR https://github.com/NorESSI/2024-staging/pull/427 to approve ingest
May 30 09:18:25 AM UTC 2024 approved 👍 tarball eessi-2023.06-software-linux-aarch64-generic-1717036716.tar.gz approved, see PR https://github.com/NorESSI/2024-staging/pull/427
May 30 09:27:32 AM UTC 2024 ingested 🎉 tarball eessi-2023.06-software-linux-aarch64-generic-1717036716.tar.gz successfully ingested at 2023.06/

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

New job on instance eX3-NESSI for architecture aarch64-generic for repository nessi-2023.06-swl-deb11 in job dir /home/thomarob/pilot.nessi.no/jobs/2024.05/pr_369/215254

date job status comment
May 29 04:20:40 PM UTC 2024 submitted job id 215254 awaits release by job manager
May 29 04:20:54 PM UTC 2024 released job awaits launch by Slurm scheduler
May 29 04:22:02 PM UTC 2024 running job 215254 is running
May 29 04:56:54 PM UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-215254.out
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
No artefacts were created or found.
May 29 04:56:54 PM UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-215254.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented May 29, 2024

Try again deb10 container

bot: build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance AWS-MC-NESSI (click for details)
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance Fram-NESSI (click for details)
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance eX3-NESSI (click for details)
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic resulted in:

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance Saga-NESSI (click for details)
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

New job on instance eX3-NESSI for architecture aarch64-generic for repository nessi-2023.06-swl-deb10 in job dir /home/thomarob/pilot.nessi.no/jobs/2024.05/pr_369/215260

date job status comment
May 29 04:54:11 PM UTC 2024 submitted job id 215260 awaits release by job manager
May 29 04:54:35 PM UTC 2024 released job awaits launch by Slurm scheduler
May 29 04:55:48 PM UTC 2024 running job 215260 is running
Jun 02 04:57:52 PM UTC 2024 finished
🤷 UNKNOWN (click triangle for details)
  • Job results file _bot_job215260.result does not exist in job directory, or parsing it failed.
  • No artefacts were found/reported.
Jun 02 04:57:52 PM UTC 2024 test result
🤷 UNKNOWN (click triangle for details)
  • Job test file _bot_job215260.test does not exist in job directory, or parsing it failed.

@trz42
Copy link
Collaborator Author

trz42 commented May 29, 2024

Try running tests sequentially...

bot: build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance AWS-MC-NESSI (click for details)
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance eX3-NESSI (click for details)
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic resulted in:

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance Fram-NESSI (click for details)
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

Updates by the bot instance Saga-NESSI (click for details)
  • received bot command build inst:eX3-NESSI repo:nessi-2023.06-swl-deb10 arch:aarch64/generic from trz42

    • expanded format: build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic
  • handling command build instance:eX3-NESSI repository:nessi-2023.06-swl-deb10 architecture:aarch64/generic resulted in:

    • no jobs were submitted

@nessi-bot
Copy link

nessi-bot bot commented May 29, 2024

New job on instance eX3-NESSI for architecture aarch64-generic for repository nessi-2023.06-swl-deb10 in job dir /home/thomarob/pilot.nessi.no/jobs/2024.05/pr_369/215297

date job status comment
May 29 09:12:53 PM UTC 2024 submitted job id 215297 awaits release by job manager
May 29 09:13:15 PM UTC 2024 released job awaits launch by Slurm scheduler
May 29 09:14:29 PM UTC 2024 running job 215297 is running
May 31 08:05:56 AM UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-215297.out
❌ found message matching ERROR:
❌ found message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-aarch64-generic-1717141918.tar.gzsize: 301 MiB (316566501 bytes)
entries: 117
modules under 2023.06/software/linux/aarch64/generic/modules/all
magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/aarch64/generic/software
magma/2.7.2-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/aarch64/generic
2023.06/init/easybuild/eb_hooks.py
May 31 08:05:56 AM UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-215297.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42
Copy link
Collaborator Author

trz42 commented May 30, 2024

Checklist before starting deployment (setting bot:deploy label):

  • Check if the SPDX license identifier is provided
  • Check whether builds for all required architectures succeed (SUCCESS message + reasonably sized tarball)
    • lots of problems building for aarch64/generic on eX3, hence using a build from AWS
    • anyhow these builds should be considered experimental and will be replaced in the future
  • Check if the PR is up-to-date with the target branch nessi.no-2023.06 in the repository (if not what are the differences)
  • Assess if all requested changes are sound (checking files changed on GitHub.com)
    • it includes a change to eb_hooks.py to run tests sequentially
    • we should comment that out before merging, because it wasn't needed when building the package on AWS
  • Verify that all easyconfig/s being built are included with the EB version used (if not why not)
  • Review changes (if any) needed to get the build(s) succeed (common changes for all architectures, changes for a single architecture, changes because of build environment specifics, etc.)

@trz42
Copy link
Collaborator Author

trz42 commented May 30, 2024

We need to be very careful which staging PRs shall be approved and which shall be rejected. Also any tarball we ingest may not include the updated eb_hooks.py (running PyTorch tests sequentially on aarch64), hence subsequent PRs may/will include eb_hooks.py which could come as a surprise.

Target architectures

  • x86_64-intel-broadwell: eessi-2023.06-software-linux-x86_64-intel-broadwell-1716873120.tar.gz
  • x86_64-intel-skylake_avx512: eessi-2023.06-software-linux-x86_64-intel-skylake_avx512-1716771995.tar.gz
  • x86_64-amd-zen2: eessi-2023.06-software-linux-x86_64-amd-zen2-1716794817.tar.gz
  • x86_64-generic: eessi-2023.06-software-linux-x86_64-generic-1716836288.tar.gz
  • aarch64-generic: eessi-2023.06-software-linux-aarch64-generic-1717036716.tar.gz

Checklist for deployment/ingestion

  • at least one tarball for each architecture has been uploaded
  • at least one tarball for each architecture has been staged
  • for each staged tarball was a PR for approval/rejection opened
  • for each architecture one and only one tarball has been approved
  • for each architecture one and only one tarball was ingested
  • all tests in the CI have succeeded (rerun failed tests ~ 10 mins after last ingest)
  • has the package become available via CernVM-FS for all architectures ingested (for lmod cache updates pay attention to timestamp of files)
command & log

command

BASE_DIR=/cvmfs/pilot.nessi.no/versions/2023.06/software/linux \
ARCHS=() \
ARCHS+=("aarch64/generic") ; \
ARCHS+=("x86_64/generic") ; \
ARCHS+=("x86_64/amd/zen2") ; \
ARCHS+=("x86_64/intel/broadwell") ; \
ARCHS+=("x86_64/intel/skylake_avx512") ; \
for arch in "${ARCHS[@]}"; do \
ls -l \
${BASE_DIR}/${arch}/{software,modules/all}/magma/2.7.2-foss-2023a-CUDA-12.1.1* \
${BASE_DIR}/${arch}/{software,modules/all}/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1* ; \
done ; \
ls -l \
${BASE_DIR}/../../init/easybuild/eb_hooks.py

log - BEFORE ingestion

ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
ls: cannot access '/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1*': No such file or directory
-rw-rw-r-- 1 cvmfs cvmfs 43825 May 22 23:12 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/../../init/easybuild/eb_hooks.py 

log - AFTER ingestion

-rw-rw-r-- 1 cvmfs cvmfs 1425 May 29 19:05 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3238 May 30 04:34 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 29 19:05 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 29 19:05 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 29 19:05 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 29 19:05 lib64 -> lib

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/aarch64/generic/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 30 04:34 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 30 04:35 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 30 04:32 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 30 04:34 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 1424 May 27 11:29 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3237 May 27 20:54 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 11:29 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 11:29 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 11:29 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 27 11:29 lib64 -> lib

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/generic/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 20:54 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 20:55 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 20:52 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 27 20:54 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 1425 May 26 23:06 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3238 May 27 09:23 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 26 23:07 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 26 23:06 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 26 23:06 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 26 23:06 lib64 -> lib

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/amd/zen2/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 09:22 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 09:24 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 09:20 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 27 09:22 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 1432 May 27 11:14 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3245 May 28 07:06 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 11:14 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 11:14 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 11:14 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 27 11:14 lib64 -> lib

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/broadwell/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 28 07:06 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 28 07:08 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 28 07:02 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 28 07:06 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 1437 May 26 09:03 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/magma/2.7.2-foss-2023a-CUDA-12.1.1.lua
-rw-rw-r-- 1 cvmfs cvmfs 3250 May 27 03:02 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/modules/all/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1.lua

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/magma/2.7.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 26 09:03 easybuild
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 26 09:02 include
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 26 09:02 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 26 09:02 lib64 -> lib

/cvmfs/pilot.nessi.no/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1:
total 14
dr-xr-xr-x 2 cvmfs cvmfs 4096 May 27 03:01 bin
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 03:03 easybuild
dr-xr-xr-x 3 cvmfs cvmfs 4096 May 27 01:03 lib
lrwxrwxrwx 1 cvmfs cvmfs    3 May 27 03:01 lib64 -> lib
-rw-rw-r-- 1 cvmfs cvmfs 45378 May 26 08:15 /cvmfs/pilot.nessi.no/versions/2023.06/software/linux/../../init/easybuild/eb_hooks.py 

@trz42 trz42 added the ingested label May 30, 2024
Copy link

@poksumdo poksumdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an easy one.

We'll have to redo it once it is clear how to build such modules in EESSI and when the bot is ready for it.

In the meantime we can test it on Saga, Betzy and eX3.

@poksumdo poksumdo merged commit edf8e0a into NorESSI:nessi.no-2023.06 May 30, 2024
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants