-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: update to anndata 0.11 and memory efficient reads + writes #1152
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1152 +/- ##
==========================================
- Coverage 90.70% 90.48% -0.22%
==========================================
Files 18 18
Lines 2054 2070 +16
==========================================
+ Hits 1863 1873 +10
- Misses 191 197 +6
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for improving memory read efficiency. We should do a follow up to improve the compute efficiency using dask.distributed on a local cluster.
# Conflicts: # cellxgene_schema_cli/cellxgene_schema/validate.py
4c7f95d
to
ce7ac18
Compare
Reason for Change
Changes
_validate_raw_data_with_in_tissue_0
to traverse matrix in chunks rather than reading into memory, since the visium matrix size is no longer staticcount_matrix_nonzero
staticmethod so it may be re-used in data-portal, which currently redundantly implements its own matrix chunking + nonzero counter function.count_matrix_nonzero
to count nonzeros among a subset of columns; used by_validate_column_feature_is_filtered
. Refactor was done because the previous implementation in_validate_column_feature_is_filtered
reads the entire matrix into memory rather than leveraging the chunked reads + nonzero counts in the already existingcount_matrix_nonzero
.get_matrix_format
and simpify arg list- doesn't need adata passed in, as we don't support matrices with 0 cols or rows. Any such cases would be caught
by other validation.
- now accounts for passing in dask arrays
- NOTE: using map_blocks when there is only 1 chunk is significantly less performant, so I added a check for number of chunks before invoking.
Testing
Notes for Reviewer