Process h5ad in constant memory #963

Bento007 · 2024-07-08T18:50:50Z

Motivation

Make memory requirements predictable
Reduce datasets processing failure due to OOM

Definition of Done

Validate and write datasets in constant memory
eliminate the use of Anndata.to_memory in the code.

Tasks

It's possible this may not be possible for write scenarios with the current anndata library
accomplish this for the X and raw.X Matrix

Notes

if we may need to suggest changes to the anndata library to accomplish this.
conscider looking at dask arrays
from the anndata docs:

IO operations

Read/Write operations on h5ad and Zarr are supported. One should note that the lazy objects are materialized when this is called. For now, the anndata loaded from file won’t be loaded with dask arrays in it.

h5py is the library anndata uses for h5ad IO.
Reading h5py file directly from s3
chunk cache

The text was updated successfully, but these errors were encountered:

Bento007 · 2024-07-16T22:30:54Z

Enforcing canonical format is the biggest consumer of memory because it requires reading in the whole X and raw.X matrix. This is due to complexities in how sparse matrixes are stored and how difficult and inefficient it is to update those formats in chunks.

My recommendation is to move the enforcement of canonical to the seurat conversion step in our work flow. The reason we are enforcing canonical in the first place is due to an issue with seurat conversion. Since the seurat conversion step will not be seeing any improvements in memory efficiency, it will have enough memory to load the dataset into memory and enforce canonical.

fix: remove enforcement of canonical format for sparse matrixes #965

Bento007 · 2024-07-17T19:37:38Z

Another recommendation is to consolidate the sparse checkers in portal and cli into one. The CLI version is memory efficient and makes use of chunks.

Bento007 · 2024-07-25T20:41:07Z

Resulting issues

nayib-jose-gloria · 2024-07-29T14:47:30Z

Estimate: 2-3 weeks x 1 engineer, 1.5-2 weeks x 2 engineers, 1-1.5 weeks x 3 engineers (issues are parallelizable)

@Bento007 for comment on these estimates

ivirshup · 2024-07-29T21:54:34Z

Enforcing canonical format is the biggest consumer of memory because it requires reading in the whole X and raw.X matrix.

Btw, this should be possible without reading the whole matrix, since canonicalization should only remove entries. That part shouldn't matter if the canonicalization isn't happening in place though. I can advise on this if needed.

nayib-jose-gloria · 2024-07-31T14:10:43Z

Enforcing canonical format is the biggest consumer of memory because it requires reading in the whole X and raw.X matrix.

Btw, this should be possible without reading the whole matrix, since canonicalization should only remove entries. That part shouldn't matter if the canonicalization isn't happening in place though. I can advise on this if needed.

Gotcha, thanks Isaac. I'm going to quote this in the follow-up issue for this part of the work, so that the engineer implementing can contact you for more details when they pick it up

ivirshup · 2024-08-05T23:16:50Z

@nayib-jose-gloria, here's a small demo of how this could be done out of core using some anndata functionality:

import h5py
from scipy import sparse
import numpy as np

from anndata.experimental import CSRDataset, sparse_dataset, write_elem, read_elem

def canonicalize_batched(source: CSRDataset, sink: CSRDataset, *, step: int = 10_000):
    start = 0
    n_rows = source.shape[0]

    while n_rows > start:
        stop = min(start + step, source.shape[0])

        X_slice = source[start:stop]
        X_slice.sum_duplicates()

        sink.append(X_slice)

        start = stop

rng = np.random.default_rng()
X = sparse.random(10000, 10000, random_state=rng, format="csr", density=0.01)

# Setup written sparse matrix
f = h5py.File("tmp.h5", mode="w")
write_elem(f, "X", X)
X_backed = sparse_dataset(f["X"])

# Initialize destination sparse dataset
write_elem(f, "X_canonical", sparse.csr_matrix((0, X_backed.shape[1]), dtype=X_backed.dtype))
X_updated = sparse_dataset(f["X_canonical"])

# Canonicalize in batches:
canonicalize_batched(X_backed, X_updated, step=1000)

# Check that the results are the same:
orig = X_backed[...]
updated = X_updated[...]

# These are the same since there aren't any duplicates in the original
assert (orig != updated).nnz == 0

Bento007 added the tech Tech issues that do not require product prioritization. Tech debt, tooling, ops, etc. label Jul 8, 2024

nayib-jose-gloria self-assigned this Jul 12, 2024

This was referenced Jul 17, 2024

fix: remove enforcement of canonical format for sparse matrixes #965

Closed

chore: add tools to test memory usage of cli #966

Merged

fix: avoid reading matrix into memory during validation #964

Closed

nayib-jose-gloria assigned Bento007 and unassigned nayib-jose-gloria Jul 18, 2024

Bento007 mentioned this issue Jul 25, 2024

chore: memory profile processing pipeline chanzuckerberg/single-cell-data-portal#7300

Merged

Bento007 added the dp Data Platform Team work label Jul 25, 2024

nayib-jose-gloria assigned nayib-jose-gloria and ebezzi and unassigned Bento007 and ebezzi Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process h5ad in constant memory #963

Process h5ad in constant memory #963

Bento007 commented Jul 8, 2024 •

edited

Loading

IO operations

Bento007 commented Jul 16, 2024 •

edited

Loading

Bento007 commented Jul 17, 2024

Bento007 commented Jul 25, 2024

nayib-jose-gloria commented Jul 29, 2024

ivirshup commented Jul 29, 2024

nayib-jose-gloria commented Jul 31, 2024

ivirshup commented Aug 5, 2024

Process h5ad in constant memory #963

Process h5ad in constant memory #963

Comments

Bento007 commented Jul 8, 2024 • edited Loading

Motivation

Definition of Done

Tasks

Notes

IO operations

Bento007 commented Jul 16, 2024 • edited Loading

Bento007 commented Jul 17, 2024

Bento007 commented Jul 25, 2024

nayib-jose-gloria commented Jul 29, 2024

ivirshup commented Jul 29, 2024

nayib-jose-gloria commented Jul 31, 2024

ivirshup commented Aug 5, 2024

Bento007 commented Jul 8, 2024 •

edited

Loading

Bento007 commented Jul 16, 2024 •

edited

Loading