Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tempfile to prevent install io race crashes #929

Merged

Conversation

konstin
Copy link
Member

@konstin konstin commented Jan 15, 2024

On ubuntu and python 3.10,

cargo run -q -- pip-install --find-links https://storage.googleapis.com/jax-releases/jax_cuda_releases.html "jax[cuda12_pip]==0.4.23"

non-deterministically but for me consistently fails to install with messages such as

error: Failed to install: nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (nvidia-nccl-cu12==2.19.3)
  Caused by: failed to remove file `/home/konsti/projects/puffin/.venv/lib/python3.10/site-packages/nvidia/__init__.py`
  Caused by: No such file or directory (os error 2)
error: Failed to install: nvidia_cublas_cu12-12.3.4.1-py3-none-manylinux1_x86_64.whl (nvidia-cublas-cu12==12.3.4.1)
  Caused by: Replacing an existing file or directory failed
error: Failed to install: nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (nvidia-cuda-nvcc-cu12==12.3.107)
  Caused by: failed to hardlink file from /home/konsti/.cache/puffin/wheels-v0/pypi/nvidia-cuda-nvcc-cu12/nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64/nvidia/__init__.py to /home/konsti/projects/puffin/.venv/lib/python3.10/site-packages/nvidia/__init__.py
  Caused by: File exists (os error 17)

We install a lot of nvidia package, that all contain nvidia/__init__.py, since they all install themselves into the nvidia module:

nvidia-cublas-cu12==12.3.4.1
nvidia-cuda-cupti-cu12==12.3.101
nvidia-cuda-nvcc-cu12==12.3.107
nvidia-cuda-nvrtc-cu12==12.3.107
nvidia-cuda-runtime-cu12==12.3.101
nvidia-cudnn-cu12==8.9.7.29
nvidia-cufft-cu12==11.0.12.1
nvidia-cusolver-cu12==11.5.4.101
nvidia-cusparse-cu12==12.2.0.103
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.3.101
$  tree -L 1 .venv/lib/python3.10/site-packages/nvidia
.venv/lib/python3.10/site-packages/nvidia
├── cublas
├── cuda_cupti
├── cuda_nvcc
├── cuda_nvrtc
├── cuda_runtime
├── cudnn
├── cufft
├── cusolver
├── cusparse
├── __init__.py
├── nccl
└── nvjitlink

When installing we get a race condition, each package installation is its own thread:

  • Installer Thread 1 creates nvidia/__init__.py
  • Installer Thread 2 sees an existing nvidia/__init__.py
  • Installer Thread 3 sees an existing nvidia/__init__.py
  • Installer Thread 2 removes nvidia/__init__.py
  • Installer Thread 3 tries to remove nvidia/__init__.py, it doesn't exist anymore -> failure.

We switch to a new strategy: When the target files exists, we don't remove it, but instead hardlink the source file to a tempfile first, then renaming the tempfile to the target file. Renaming is considered an atomic operation.

I've put the logging on debug level because they cases indicate a conflict between two packages while being rare.

Closes #925

@konstin konstin added the bug Something isn't working label Jan 15, 2024
@konstin konstin marked this pull request as draft January 15, 2024 10:50
fs::remove_file(&out_path)?;
// Prevent races between multiple threads trying to replace a file
let lock = fallback_lock.lock().expect(broken_threadpool);
fs::remove_file(&out_path).map_err(Error::Fallback)?;
if fs::hard_link(path, &out_path).is_err() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead: hardlink to a temporary file, then rename the temporary file to the target? That should be safe and doesn't require any deletes.

@konstin konstin force-pushed the konsti/add-locks-on-installation-fixup-to-avoid-rages branch from 01b3d33 to ba6fefb Compare January 16, 2024 10:36
@konstin konstin changed the title Add locks to prevent install fallback race condition Use tempfile to prevent install io race crashes Jan 16, 2024
On ubuntu and python 3.10,

```
cargo run -q -- pip-install --find-links https://storage.googleapis.com/jax-releases/jax_cuda_releases.html "jax[cuda12_pip]==0.4.23"
```

non-deterministically but for me consistently fails to install with messages such as

```
error: Failed to install: nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (nvidia-nccl-cu12==2.19.3)
  Caused by: failed to remove file `/home/konsti/projects/puffin/.venv/lib/python3.10/site-packages/nvidia/__init__.py`
  Caused by: No such file or directory (os error 2)
```

```
error: Failed to install: nvidia_cublas_cu12-12.3.4.1-py3-none-manylinux1_x86_64.whl (nvidia-cublas-cu12==12.3.4.1)
  Caused by: Replacing an existing file or directory failed
```

```
error: Failed to install: nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (nvidia-cuda-nvcc-cu12==12.3.107)
  Caused by: failed to hardlink file from /home/konsti/.cache/puffin/wheels-v0/pypi/nvidia-cuda-nvcc-cu12/nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64/nvidia/__init__.py to /home/konsti/projects/puffin/.venv/lib/python3.10/site-packages/nvidia/__init__.py
  Caused by: File exists (os error 17)
```

We install a lot of nvidia package, that all contain `nvidia/__init__.py`, since they all install themselves into the `nvidia` module:

```
nvidia-cublas-cu12==12.3.4.1
nvidia-cuda-cupti-cu12==12.3.101
nvidia-cuda-nvcc-cu12==12.3.107
nvidia-cuda-nvrtc-cu12==12.3.107
nvidia-cuda-runtime-cu12==12.3.101
nvidia-cudnn-cu12==8.9.7.29
nvidia-cufft-cu12==11.0.12.1
nvidia-cusolver-cu12==11.5.4.101
nvidia-cusparse-cu12==12.2.0.103
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.3.101
```

```
$  tree -L 1 .venv/lib/python3.10/site-packages/nvidia
.venv/lib/python3.10/site-packages/nvidia
├── cublas
├── cuda_cupti
├── cuda_nvcc
├── cuda_nvrtc
├── cuda_runtime
├── cudnn
├── cufft
├── cusolver
├── cusparse
├── __init__.py
├── nccl
└── nvjitlink
```

When installing we get a race condition, each package installation is its own thread:
* Installer Thread 1 creates `nvidia/__init__.py`
* Installer Thread 2 sees an existing  `nvidia/__init__.py`
* Installer Thread 3 sees an existing  `nvidia/__init__.py`
* Installer Thread 2 removes `nvidia/__init__.py`
* Installer Thread 3 tries to remove `nvidia/__init__.py`, it doesn't exist anymore -> failure.

We switch to a new strategy: When the target files exists, we don't remove it, but instead hardlink the source file to a tempfile first, then renaming the tempfile to the target file. Renaming is considered an atomic operation.

I've put the logging on debug level because they cases indicate a conflict between two packages while being rare.

Closes #925
@konstin konstin force-pushed the konsti/add-locks-on-installation-fixup-to-avoid-rages branch from ba6fefb to 552ea8e Compare January 16, 2024 10:39
@konstin
Copy link
Member Author

konstin commented Jan 16, 2024

I've changed the strategy to tempfile-hardlink-and-rename and updated the description.

@charliermarsh charliermarsh marked this pull request as ready for review January 16, 2024 21:02
@charliermarsh charliermarsh enabled auto-merge (squash) January 16, 2024 21:06
@charliermarsh charliermarsh merged commit 5051b2c into main Jan 16, 2024
3 checks passed
@charliermarsh charliermarsh deleted the konsti/add-locks-on-installation-fixup-to-avoid-rages branch January 16, 2024 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jax fails to install due to race condition
2 participants