-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use tempfile to prevent install io race crashes #929
Merged
charliermarsh
merged 2 commits into
main
from
konsti/add-locks-on-installation-fixup-to-avoid-rages
Jan 16, 2024
Merged
Use tempfile to prevent install io race crashes #929
charliermarsh
merged 2 commits into
main
from
konsti/add-locks-on-installation-fixup-to-avoid-rages
Jan 16, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fs::remove_file(&out_path)?; | ||
// Prevent races between multiple threads trying to replace a file | ||
let lock = fallback_lock.lock().expect(broken_threadpool); | ||
fs::remove_file(&out_path).map_err(Error::Fallback)?; | ||
if fs::hard_link(path, &out_path).is_err() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead: hardlink to a temporary file, then rename the temporary file to the target? That should be safe and doesn't require any deletes.
konstin
force-pushed
the
konsti/add-locks-on-installation-fixup-to-avoid-rages
branch
from
January 16, 2024 10:36
01b3d33
to
ba6fefb
Compare
konstin
changed the title
Add locks to prevent install fallback race condition
Use tempfile to prevent install io race crashes
Jan 16, 2024
On ubuntu and python 3.10, ``` cargo run -q -- pip-install --find-links https://storage.googleapis.com/jax-releases/jax_cuda_releases.html "jax[cuda12_pip]==0.4.23" ``` non-deterministically but for me consistently fails to install with messages such as ``` error: Failed to install: nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (nvidia-nccl-cu12==2.19.3) Caused by: failed to remove file `/home/konsti/projects/puffin/.venv/lib/python3.10/site-packages/nvidia/__init__.py` Caused by: No such file or directory (os error 2) ``` ``` error: Failed to install: nvidia_cublas_cu12-12.3.4.1-py3-none-manylinux1_x86_64.whl (nvidia-cublas-cu12==12.3.4.1) Caused by: Replacing an existing file or directory failed ``` ``` error: Failed to install: nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (nvidia-cuda-nvcc-cu12==12.3.107) Caused by: failed to hardlink file from /home/konsti/.cache/puffin/wheels-v0/pypi/nvidia-cuda-nvcc-cu12/nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64/nvidia/__init__.py to /home/konsti/projects/puffin/.venv/lib/python3.10/site-packages/nvidia/__init__.py Caused by: File exists (os error 17) ``` We install a lot of nvidia package, that all contain `nvidia/__init__.py`, since they all install themselves into the `nvidia` module: ``` nvidia-cublas-cu12==12.3.4.1 nvidia-cuda-cupti-cu12==12.3.101 nvidia-cuda-nvcc-cu12==12.3.107 nvidia-cuda-nvrtc-cu12==12.3.107 nvidia-cuda-runtime-cu12==12.3.101 nvidia-cudnn-cu12==8.9.7.29 nvidia-cufft-cu12==11.0.12.1 nvidia-cusolver-cu12==11.5.4.101 nvidia-cusparse-cu12==12.2.0.103 nvidia-nccl-cu12==2.19.3 nvidia-nvjitlink-cu12==12.3.101 ``` ``` $ tree -L 1 .venv/lib/python3.10/site-packages/nvidia .venv/lib/python3.10/site-packages/nvidia ├── cublas ├── cuda_cupti ├── cuda_nvcc ├── cuda_nvrtc ├── cuda_runtime ├── cudnn ├── cufft ├── cusolver ├── cusparse ├── __init__.py ├── nccl └── nvjitlink ``` When installing we get a race condition, each package installation is its own thread: * Installer Thread 1 creates `nvidia/__init__.py` * Installer Thread 2 sees an existing `nvidia/__init__.py` * Installer Thread 3 sees an existing `nvidia/__init__.py` * Installer Thread 2 removes `nvidia/__init__.py` * Installer Thread 3 tries to remove `nvidia/__init__.py`, it doesn't exist anymore -> failure. We switch to a new strategy: When the target files exists, we don't remove it, but instead hardlink the source file to a tempfile first, then renaming the tempfile to the target file. Renaming is considered an atomic operation. I've put the logging on debug level because they cases indicate a conflict between two packages while being rare. Closes #925
konstin
force-pushed
the
konsti/add-locks-on-installation-fixup-to-avoid-rages
branch
from
January 16, 2024 10:39
ba6fefb
to
552ea8e
Compare
I've changed the strategy to tempfile-hardlink-and-rename and updated the description. |
charliermarsh
approved these changes
Jan 16, 2024
charliermarsh
deleted the
konsti/add-locks-on-installation-fixup-to-avoid-rages
branch
January 16, 2024 21:07
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On ubuntu and python 3.10,
non-deterministically but for me consistently fails to install with messages such as
We install a lot of nvidia package, that all contain
nvidia/__init__.py
, since they all install themselves into thenvidia
module:When installing we get a race condition, each package installation is its own thread:
nvidia/__init__.py
nvidia/__init__.py
nvidia/__init__.py
nvidia/__init__.py
nvidia/__init__.py
, it doesn't exist anymore -> failure.We switch to a new strategy: When the target files exists, we don't remove it, but instead hardlink the source file to a tempfile first, then renaming the tempfile to the target file. Renaming is considered an atomic operation.
I've put the logging on debug level because they cases indicate a conflict between two packages while being rare.
Closes #925