Skip to content

Commit

Permalink
Use tempfile to prevent install io race crashes (#929)
Browse files Browse the repository at this point in the history
On ubuntu and python 3.10,

```
cargo run -q -- pip-install --find-links https://storage.googleapis.com/jax-releases/jax_cuda_releases.html "jax[cuda12_pip]==0.4.23"
```

non-deterministically but for me consistently fails to install with
messages such as

```
error: Failed to install: nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (nvidia-nccl-cu12==2.19.3)
  Caused by: failed to remove file `/home/konsti/projects/puffin/.venv/lib/python3.10/site-packages/nvidia/__init__.py`
  Caused by: No such file or directory (os error 2)
```

```
error: Failed to install: nvidia_cublas_cu12-12.3.4.1-py3-none-manylinux1_x86_64.whl (nvidia-cublas-cu12==12.3.4.1)
  Caused by: Replacing an existing file or directory failed
```

```
error: Failed to install: nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64.whl (nvidia-cuda-nvcc-cu12==12.3.107)
  Caused by: failed to hardlink file from /home/konsti/.cache/puffin/wheels-v0/pypi/nvidia-cuda-nvcc-cu12/nvidia_cuda_nvcc_cu12-12.3.107-py3-none-manylinux1_x86_64/nvidia/__init__.py to /home/konsti/projects/puffin/.venv/lib/python3.10/site-packages/nvidia/__init__.py
  Caused by: File exists (os error 17)
```

We install a lot of nvidia package, that all contain
`nvidia/__init__.py`, since they all install themselves into the
`nvidia` module:

```
nvidia-cublas-cu12==12.3.4.1
nvidia-cuda-cupti-cu12==12.3.101
nvidia-cuda-nvcc-cu12==12.3.107
nvidia-cuda-nvrtc-cu12==12.3.107
nvidia-cuda-runtime-cu12==12.3.101
nvidia-cudnn-cu12==8.9.7.29
nvidia-cufft-cu12==11.0.12.1
nvidia-cusolver-cu12==11.5.4.101
nvidia-cusparse-cu12==12.2.0.103
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.3.101
```

```
$  tree -L 1 .venv/lib/python3.10/site-packages/nvidia
.venv/lib/python3.10/site-packages/nvidia
├── cublas
├── cuda_cupti
├── cuda_nvcc
├── cuda_nvrtc
├── cuda_runtime
├── cudnn
├── cufft
├── cusolver
├── cusparse
├── __init__.py
├── nccl
└── nvjitlink
```

When installing we get a race condition, each package installation is
its own thread:
* Installer Thread 1 creates `nvidia/__init__.py`
* Installer Thread 2 sees an existing  `nvidia/__init__.py`
* Installer Thread 3 sees an existing  `nvidia/__init__.py`
* Installer Thread 2 removes `nvidia/__init__.py`
* Installer Thread 3 tries to remove `nvidia/__init__.py`, it doesn't
exist anymore -> failure.

We switch to a new strategy: When the target files exists, we don't
remove it, but instead hardlink the source file to a tempfile first,
then renaming the tempfile to the target file. Renaming is considered an
atomic operation.

I've put the logging on debug level because they cases indicate a
conflict between two packages while being rare.

Closes #925

---------

Co-authored-by: Charlie Marsh <[email protected]>
  • Loading branch information
konstin and charliermarsh authored Jan 16, 2024
1 parent 73fccdd commit 5051b2c
Showing 1 changed file with 19 additions and 4 deletions.
23 changes: 19 additions & 4 deletions crates/install-wheel-rs/src/linker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ use std::path::Path;
use configparser::ini::Ini;
use fs_err as fs;
use fs_err::File;
use tempfile::tempdir_in;
use tracing::{debug, instrument};

use pypi_types::DirectUrl;
Expand Down Expand Up @@ -398,11 +399,18 @@ fn hardlink_wheel_files(
if let Err(err) = fs::hard_link(path, &out_path) {
// If the file already exists, remove it and try again.
if err.kind() == std::io::ErrorKind::AlreadyExists {
fs::remove_file(&out_path)?;
if fs::hard_link(path, &out_path).is_err() {
debug!(
"File already exists (initial attempt), overwriting: {}",
out_path.display()
);
// Removing and recreating would lead to race conditions.
let tempdir = tempdir_in(&site_packages)?;
let tempfile = tempdir.path().join(entry.file_name());
if fs::hard_link(path, &tempfile).is_err() {
fs::copy(path, &out_path)?;
attempt = Attempt::UseCopyFallback;
}
fs_err::rename(&tempfile, &out_path)?;
} else {
fs::copy(path, &out_path)?;
attempt = Attempt::UseCopyFallback;
Expand All @@ -413,8 +421,15 @@ fn hardlink_wheel_files(
if let Err(err) = fs::hard_link(path, &out_path) {
// If the file already exists, remove it and try again.
if err.kind() == std::io::ErrorKind::AlreadyExists {
fs::remove_file(&out_path)?;
fs::hard_link(path, &out_path)?;
debug!(
"File already exists (subsequent attempt), overwriting: {}",
out_path.display()
);
// Removing and recreating would lead to race conditions.
let tempdir = tempdir_in(&site_packages)?;
let tempfile = tempdir.path().join(entry.file_name());
fs::hard_link(path, &tempfile)?;
fs_err::rename(&tempfile, &out_path)?;
} else {
return Err(err.into());
}
Expand Down

0 comments on commit 5051b2c

Please sign in to comment.