Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update to anndata 0.11 and memory efficient reads + writes #1152

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

nayib-jose-gloria
Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria commented Dec 11, 2024

Reason for Change

Changes

  • modify Isaac/Emanuele's alternate backed h5ad read approach for a more memory-efficient h5ad read. Same concept, but returns matrix layers as dask arrays for chunked, memory-efficient writes.
  • use single-threaded dask cluster for memory efficient h5ad writes
  • update _validate_raw_data_with_in_tissue_0 to traverse matrix in chunks rather than reading into memory, since the visium matrix size is no longer static
  • Make count_matrix_nonzero staticmethod so it may be re-used in data-portal, which currently redundantly implements its own matrix chunking + nonzero counter function.
  • update count_matrix_nonzero to count nonzeros among a subset of columns; used by _validate_column_feature_is_filtered. Refactor was done because the previous implementation in _validate_column_feature_is_filtered reads the entire matrix into memory rather than leveraging the chunked reads + nonzero counts in the already existing count_matrix_nonzero.
  • update get_matrix_format and simpify arg list
    - doesn't need adata passed in, as we don't support matrices with 0 cols or rows. Any such cases would be caught
    by other validation.
    - now accounts for passing in dask arrays
  • update 'chunk_matrix' to account for dask arrays
  • remove TODOs for re-evaluating anndata mixed type checks; anndata does not plan on supporting mixed types so the checks must stay
  • remove chunk_matrix custom chunking function, and use dask chunking (map_blocks)
    - NOTE: using map_blocks when there is only 1 chunk is significantly less performant, so I added a check for number of chunks before invoking.

Testing

  • Tested with 20 GB H5AD that currently requires >20GB memory allocated for anndata processing, was able to process in comparable time on my machine using only 2.5 GB of memory.
  • Also tested in rdev environment pointing to this branch CLI version. Was able to DownloadValidate the 20 GB H5AD in a 1 VCPU 8GB memory machine, when it was previously being allocated an 8 VCPU 64 GB machine, with minimal impact to speed.

Notes for Reviewer

Copy link

codecov bot commented Dec 12, 2024

Codecov Report

Attention: Patch coverage is 86.20690% with 12 lines in your changes missing coverage. Please review.

Project coverage is 90.48%. Comparing base (2fc4898) to head (4e20f61).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1152      +/-   ##
==========================================
- Coverage   90.70%   90.48%   -0.22%     
==========================================
  Files          18       18              
  Lines        2054     2070      +16     
==========================================
+ Hits         1863     1873      +10     
- Misses        191      197       +6     
Components Coverage Δ
cellxgene_schema_cli 91.78% <86.20%> (-0.34%) ⬇️
migration_assistant 91.26% <ø> (ø)
schema_bump_dry_run_genes 79.80% <ø> (ø)
schema_bump_dry_run_ontologies 99.53% <ø> (ø)

Copy link
Contributor

@Bento007 Bento007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for improving memory read efficiency. We should do a follow up to improve the compute efficiency using dask.distributed on a local cluster.

cellxgene_schema_cli/cellxgene_schema/utils.py Outdated Show resolved Hide resolved
cellxgene_schema_cli/cellxgene_schema/utils.py Outdated Show resolved Hide resolved
cellxgene_schema_cli/cellxgene_schema/utils.py Outdated Show resolved Hide resolved
cellxgene_schema_cli/cellxgene_schema/validate.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants