feat: update to anndata 0.11 and memory efficient reads + writes #1152

nayib-jose-gloria · 2024-12-11T00:10:13Z

Reason for Change

https://czi.atlassian.net/browse/VC-1334

Changes

modify Isaac/Emanuele's alternate backed h5ad read approach for a more memory-efficient h5ad read. Same concept, but returns matrix layers as dask arrays for chunked, memory-efficient writes.
use single-threaded dask cluster for memory efficient h5ad writes
update _validate_raw_data_with_in_tissue_0 to traverse matrix in chunks rather than reading into memory, since the visium matrix size is no longer static
Make count_matrix_nonzero staticmethod so it may be re-used in data-portal, which currently redundantly implements its own matrix chunking + nonzero counter function.
update count_matrix_nonzero to count nonzeros among a subset of columns; used by _validate_column_feature_is_filtered. Refactor was done because the previous implementation in _validate_column_feature_is_filtered reads the entire matrix into memory rather than leveraging the chunked reads + nonzero counts in the already existing count_matrix_nonzero.
update get_matrix_format and simpify arg list
- doesn't need adata passed in, as we don't support matrices with 0 cols or rows. Any such cases would be caught
by other validation.
- now accounts for passing in dask arrays
update 'chunk_matrix' to account for dask arrays
remove TODOs for re-evaluating anndata mixed type checks; anndata does not plan on supporting mixed types so the checks must stay
remove chunk_matrix custom chunking function, and use dask chunking (map_blocks)
- NOTE: using map_blocks when there is only 1 chunk is significantly less performant, so I added a check for number of chunks before invoking.

Testing

Tested with 20 GB H5AD that currently requires >20GB memory allocated for anndata processing, was able to process in comparable time on my machine using only 2.5 GB of memory.
Also tested in rdev environment pointing to this branch CLI version. Was able to DownloadValidate the 20 GB H5AD in a 1 VCPU 8GB memory machine, when it was previously being allocated an 8 VCPU 64 GB machine, with minimal impact to speed.

Notes for Reviewer

codecov · 2024-12-12T03:35:48Z

Codecov Report

Attention: Patch coverage is 86.20690% with 12 lines in your changes missing coverage. Please review.

Project coverage is 90.48%. Comparing base (2fc4898) to head (4e20f61).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1152      +/-   ##
==========================================
- Coverage   90.70%   90.48%   -0.22%     
==========================================
  Files          18       18              
  Lines        2054     2070      +16     
==========================================
+ Hits         1863     1873      +10     
- Misses        191      197       +6

Components	Coverage Δ
cellxgene_schema_cli	`91.78% <86.20%> (-0.34%)`	⬇️
migration_assistant	`91.26% <ø> (ø)`
schema_bump_dry_run_genes	`79.80% <ø> (ø)`
schema_bump_dry_run_ontologies	`99.53% <ø> (ø)`

Bento007

Looks good for improving memory read efficiency. We should do a follow up to improve the compute efficiency using dask.distributed on a local cluster.

cellxgene_schema_cli/cellxgene_schema/utils.py

cellxgene_schema_cli/cellxgene_schema/validate.py

# Conflicts: # cellxgene_schema_cli/cellxgene_schema/validate.py

cellxgene_schema_cli/cellxgene_schema/validate.py

cellxgene_schema_cli/tests/fixtures/examples_validate.py

…ixtures

nayib-jose-gloria and others added 6 commits December 6, 2024 18:27

feat: update to anndata 0.11 and memory efficient reads

d9f6afc

use dask for memory-efficient writes

0189f7e

remove TODOs for features anndata is not planning to support

90441e7

Merge branch 'main' into nayib/anndata-0-11

743abd7

use dask arrays

9e2ef74

revert local changes

6a18d94

nayib-jose-gloria requested review from Bento007, ejmolinelli and ebezzi December 12, 2024 17:10

Bento007 requested changes Dec 13, 2024

View reviewed changes

nayib-jose-gloria added 2 commits December 16, 2024 11:10

replace custom chunk_matrix function with dask built-in chunking

eec787b

Merge branch 'main' into nayib/anndata-0-11

ce7ac18

# Conflicts: # cellxgene_schema_cli/cellxgene_schema/validate.py

nayib-jose-gloria force-pushed the nayib/anndata-0-11 branch from 4c7f95d to ce7ac18 Compare December 16, 2024 17:30

nayib-jose-gloria requested a review from Bento007 December 16, 2024 17:31

use single-threaded

26abac4

Bento007 requested changes Dec 17, 2024

View reviewed changes

cellxgene_schema_cli/cellxgene_schema/validate.py Outdated Show resolved Hide resolved

cellxgene_schema_cli/tests/fixtures/examples_validate.py Outdated Show resolved Hide resolved

nayib-jose-gloria and others added 2 commits December 18, 2024 13:05

filter columns in matrix outside of count_matrix_nonzero + fix test f…

1fdeebe

…ixtures

Merge branch 'main' into nayib/anndata-0-11

f33e3c3

Bento007 approved these changes Dec 19, 2024

View reviewed changes

nayib-jose-gloria added 5 commits December 20, 2024 16:48

chunk dense arrays in same chunk_size as sparse arrays

fa25f40

don't accidentally dask process metadata arrays with X in their name

2df0e0d

don't pick up embeddings with 'layers' in the name

6dddf18

leverage implicit 0 requirement in X matrix

b8c8c29

configurable chunk_size in read_h5ad

4e20f61

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update to anndata 0.11 and memory efficient reads + writes #1152

feat: update to anndata 0.11 and memory efficient reads + writes #1152

nayib-jose-gloria commented Dec 11, 2024 •

edited

Loading

codecov bot commented Dec 12, 2024 •

edited

Loading

Bento007 left a comment

feat: update to anndata 0.11 and memory efficient reads + writes #1152

Are you sure you want to change the base?

feat: update to anndata 0.11 and memory efficient reads + writes #1152

Conversation

nayib-jose-gloria commented Dec 11, 2024 • edited Loading

Reason for Change

Changes

Testing

Notes for Reviewer

codecov bot commented Dec 12, 2024 • edited Loading

Codecov Report

Bento007 left a comment

Choose a reason for hiding this comment

nayib-jose-gloria commented Dec 11, 2024 •

edited

Loading

codecov bot commented Dec 12, 2024 •

edited

Loading