Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process h5ad in constant memory #963

Open
Bento007 opened this issue Jul 8, 2024 · 7 comments
Open

Process h5ad in constant memory #963

Bento007 opened this issue Jul 8, 2024 · 7 comments
Assignees
Labels
dp Data Platform Team work tech Tech issues that do not require product prioritization. Tech debt, tooling, ops, etc.

Comments

@Bento007
Copy link
Contributor

Bento007 commented Jul 8, 2024

Motivation

  • Make memory requirements predictable
  • Reduce datasets processing failure due to OOM

Definition of Done

  • Validate and write datasets in constant memory
  • eliminate the use of Anndata.to_memory in the code.

Tasks

  • It's possible this may not be possible for write scenarios with the current anndata library
  • accomplish this for the X and raw.X Matrix

Notes

  • if we may need to suggest changes to the anndata library to accomplish this.
  • conscider looking at dask arrays
  • from the anndata docs:

IO operations

Read/Write operations on h5ad and Zarr are supported. One should note that the lazy objects are materialized when this is called. For now, the anndata loaded from file won’t be loaded with dask arrays in it.

@Bento007 Bento007 added the tech Tech issues that do not require product prioritization. Tech debt, tooling, ops, etc. label Jul 8, 2024
@nayib-jose-gloria nayib-jose-gloria self-assigned this Jul 12, 2024
@Bento007
Copy link
Contributor Author

Bento007 commented Jul 16, 2024

Enforcing canonical format is the biggest consumer of memory because it requires reading in the whole X and raw.X matrix. This is due to complexities in how sparse matrixes are stored and how difficult and inefficient it is to update those formats in chunks.

My recommendation is to move the enforcement of canonical to the seurat conversion step in our work flow. The reason we are enforcing canonical in the first place is due to an issue with seurat conversion. Since the seurat conversion step will not be seeing any improvements in memory efficiency, it will have enough memory to load the dataset into memory and enforce canonical.

@Bento007
Copy link
Contributor Author

Another recommendation is to consolidate the sparse checkers in portal and cli into one. The CLI version is memory efficient and makes use of chunks.

@nayib-jose-gloria
Copy link
Contributor

Estimate: 2-3 weeks x 1 engineer, 1.5-2 weeks x 2 engineers, 1-1.5 weeks x 3 engineers (issues are parallelizable)

@Bento007 for comment on these estimates

@ivirshup
Copy link

Enforcing canonical format is the biggest consumer of memory because it requires reading in the whole X and raw.X matrix.

Btw, this should be possible without reading the whole matrix, since canonicalization should only remove entries. That part shouldn't matter if the canonicalization isn't happening in place though. I can advise on this if needed.

@nayib-jose-gloria
Copy link
Contributor

Enforcing canonical format is the biggest consumer of memory because it requires reading in the whole X and raw.X matrix.

Btw, this should be possible without reading the whole matrix, since canonicalization should only remove entries. That part shouldn't matter if the canonicalization isn't happening in place though. I can advise on this if needed.

Gotcha, thanks Isaac. I'm going to quote this in the follow-up issue for this part of the work, so that the engineer implementing can contact you for more details when they pick it up

@ivirshup
Copy link

ivirshup commented Aug 5, 2024

@nayib-jose-gloria, here's a small demo of how this could be done out of core using some anndata functionality:

import h5py
from scipy import sparse
import numpy as np

from anndata.experimental import CSRDataset, sparse_dataset, write_elem, read_elem

def canonicalize_batched(source: CSRDataset, sink: CSRDataset, *, step: int = 10_000):
    start = 0
    n_rows = source.shape[0]

    while n_rows > start:
        stop = min(start + step, source.shape[0])

        X_slice = source[start:stop]
        X_slice.sum_duplicates()

        sink.append(X_slice)

        start = stop

rng = np.random.default_rng()
X = sparse.random(10000, 10000, random_state=rng, format="csr", density=0.01)

# Setup written sparse matrix
f = h5py.File("tmp.h5", mode="w")
write_elem(f, "X", X)
X_backed = sparse_dataset(f["X"])

# Initialize destination sparse dataset
write_elem(f, "X_canonical", sparse.csr_matrix((0, X_backed.shape[1]), dtype=X_backed.dtype))
X_updated = sparse_dataset(f["X_canonical"])

# Canonicalize in batches:
canonicalize_batched(X_backed, X_updated, step=1000)

# Check that the results are the same:
orig = X_backed[...]
updated = X_updated[...]

# These are the same since there aren't any duplicates in the original
assert (orig != updated).nnz == 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dp Data Platform Team work tech Tech issues that do not require product prioritization. Tech debt, tooling, ops, etc.
Projects
None yet
Development

No branches or pull requests

4 participants