-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process h5ad in constant memory #963
Comments
Enforcing canonical format is the biggest consumer of memory because it requires reading in the whole X and raw.X matrix. This is due to complexities in how sparse matrixes are stored and how difficult and inefficient it is to update those formats in chunks. My recommendation is to move the enforcement of canonical to the seurat conversion step in our work flow. The reason we are enforcing canonical in the first place is due to an issue with seurat conversion. Since the seurat conversion step will not be seeing any improvements in memory efficiency, it will have enough memory to load the dataset into memory and enforce canonical. |
Another recommendation is to consolidate the sparse checkers in portal and cli into one. The CLI version is memory efficient and makes use of chunks. |
Estimate: 2-3 weeks x 1 engineer, 1.5-2 weeks x 2 engineers, 1-1.5 weeks x 3 engineers (issues are parallelizable) @Bento007 for comment on these estimates |
Btw, this should be possible without reading the whole matrix, since canonicalization should only remove entries. That part shouldn't matter if the canonicalization isn't happening in place though. I can advise on this if needed. |
Gotcha, thanks Isaac. I'm going to quote this in the follow-up issue for this part of the work, so that the engineer implementing can contact you for more details when they pick it up |
@nayib-jose-gloria, here's a small demo of how this could be done out of core using some anndata functionality: import h5py
from scipy import sparse
import numpy as np
from anndata.experimental import CSRDataset, sparse_dataset, write_elem, read_elem
def canonicalize_batched(source: CSRDataset, sink: CSRDataset, *, step: int = 10_000):
start = 0
n_rows = source.shape[0]
while n_rows > start:
stop = min(start + step, source.shape[0])
X_slice = source[start:stop]
X_slice.sum_duplicates()
sink.append(X_slice)
start = stop
rng = np.random.default_rng()
X = sparse.random(10000, 10000, random_state=rng, format="csr", density=0.01)
# Setup written sparse matrix
f = h5py.File("tmp.h5", mode="w")
write_elem(f, "X", X)
X_backed = sparse_dataset(f["X"])
# Initialize destination sparse dataset
write_elem(f, "X_canonical", sparse.csr_matrix((0, X_backed.shape[1]), dtype=X_backed.dtype))
X_updated = sparse_dataset(f["X_canonical"])
# Canonicalize in batches:
canonicalize_batched(X_backed, X_updated, step=1000)
# Check that the results are the same:
orig = X_backed[...]
updated = X_updated[...]
# These are the same since there aren't any duplicates in the original
assert (orig != updated).nnz == 0 |
Motivation
Definition of Done
Anndata.to_memory
in the code.Tasks
Notes
The text was updated successfully, but these errors were encountered: