feat: {column}_colors validation #625

joyceyan · 2023-09-15T18:42:34Z

https://app.zenhub.com/workspaces/single-cell-5e2a191dad828d52cc78b028/issues/gh/chanzuckerberg/single-cell-curation/513

i generated the new example_valid.h5ad by running this in a python notebook. basically i just copied everything in examples_validate.py and wrote to an h5ad file

import pandas as pd
import numpy
import anndata
import os
from scipy import sparse
from cellxgene_schema.utils import get_hash_digest_column
from cellxgene_schema.validate import Validator

# -----------------------------------------------------------------#
# Manually creating minimal anndata objects.
#
# The valid objects mentioned below contain all valid cases covered in the schema, including multiple examples for
# fields that allow multiple valid options.
#
# This process entails:
# 1. Creating individual obs components: one valid dataframe, and one with labels (extra columns that are supposed
#   to be added by validator)
# 2. Creating individual var components: valid, and one with labels
# 3. Creating individual uns valid component
# 4. Creating expression matrices
# 5. Creating valid obsm
# 6. Putting all the components created in the previous steps into minimal anndata that used for testing in
#   the unittests

# Valid obs per schema
good_obs = pd.DataFrame(
    [
        [
            "CL:0000066",
            "EFO:0009899",
            "MONDO:0100096",
            "NCBITaxon:9606",
            "PATO:0000383",
            "UBERON:0002048",
            "tissue",
            True,
            "HANCESTRO:0575",
            "HsapDv:0000003",
            "donor_1",
            "nucleus",
        ],
        [
            "CL:0000192",
            "EFO:0009918",
            "PATO:0000461",
            "NCBITaxon:10090",
            "unknown",
            "CL:0000192",
            "cell culture",
            False,
            "na",
            "MmusDv:0000003",
            "donor_2",
            "na",
        ],
    ],
    index=["X", "Y"],
    columns=[
        "cell_type_ontology_term_id",
        "assay_ontology_term_id",
        "disease_ontology_term_id",
        "organism_ontology_term_id",
        "sex_ontology_term_id",
        "tissue_ontology_term_id",
        "tissue_type",
        "is_primary_data",
        "self_reported_ethnicity_ontology_term_id",
        "development_stage_ontology_term_id",
        "donor_id",
        "suspension_type",
    ],
)

good_obs.loc[:, ["donor_id"]] = good_obs.astype("category")
good_obs.loc[:, ["suspension_type"]] = good_obs.astype("category")
good_obs.loc[:, ["tissue_type"]] = good_obs.astype("category")

# Expected obs, this is what the obs above should look like after adding the necessary columns with the validator,
# these columns are defined in the schema
obs_expected = pd.DataFrame(
    [
        [
            "epithelial cell",
            "10x 3' v2",
            "COVID-19",
            "Homo sapiens",
            "female",
            "lung",
            "Yoruban",
            "Carnegie stage 01",
        ],
        [
            "smooth muscle cell",
            "smFISH",
            "normal",
            "Mus musculus",
            "unknown",
            "smooth muscle cell",
            "na",
            "Theiler stage 01",
        ],
    ],
    index=["X", "Y"],
    columns=[
        "cell_type",
        "assay",
        "disease",
        "organism",
        "sex",
        "tissue",
        "self_reported_ethnicity",
        "development_stage",
    ],
)

obs_expected["observation_joinid"] = get_hash_digest_column(obs_expected)

# ---
# 2. Creating individual var components: valid object and valid object and with labels

# Valid var per schema
good_var = pd.DataFrame(
    [
        [False],
        [False],
        [False],
        [False],
    ],
    index=["ERCC-00002", "ENSG00000127603", "ENSMUSG00000059552", "ENSSASG00005000004"],
    columns=["feature_is_filtered"],
)

# Expected var, this is what the obs above should look like after adding the necessary columns with the validator,
# these columns are defined in the schema
var_expected = pd.DataFrame(
    [
        ["spike-in", False, "ERCC-00002 (spike-in control)", "NCBITaxon:32630", 0],
        ["gene", False, "MACF1", "NCBITaxon:9606", 42738],
        ["gene", False, "Trp53", "NCBITaxon:10090", 4045],
        ["gene", False, "S", "NCBITaxon:2697049", 3822],
    ],
    index=["ERCC-00002", "ENSG00000127603", "ENSMUSG00000059552", "ENSSASG00005000004"],
    columns=[
        "feature_biotype",
        "feature_is_filtered",
        "feature_name",
        "feature_reference",
        "feature_length",
    ],
)

# ---
# 3. Creating individual uns component
good_uns = {
    "title": "A title",
    "default_embedding": "X_umap",
    "X_approximate_distribution": "normal",
    "batch_condition": ["is_primary_data"],
    "donor_id_colors": ["black", "pink"],
    "suspension_type_colors": ["red", "#000000"],
    "tissue_type_colors": ["blue", "#ffffff"],
}

good_uns_with_labels = {
    "schema_version": "4.0.0",
    "schema_reference": "https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/4.0.0/schema.md",
    "title": "A title",
    "default_embedding": "X_umap",
    "X_approximate_distribution": "normal",
    "batch_condition": ["is_primary_data"],
    "donor_id_colors": ["black", "pink"],
    "suspension_type_colors": ["red", "#000000"],
    "tissue_type_colors": ["blue", "#ffffff"],
}

# ---
# 4. Creating expression matrix,
# X has integer values and non_raw_X has real values
X = numpy.zeros([good_obs.shape[0], good_var.shape[0]], dtype=numpy.float32)
non_raw_X = sparse.csr_matrix(X.copy())
non_raw_X[0, 0] = 1.5


# ---
# 5.Creating valid obsm
good_obsm = {"X_umap": numpy.zeros([X.shape[0], 2])}


# ---
# 6. Putting all the components created in the previous steps into minimal anndata that used for testing in
#   the unittests

# Valid anndata
adata = anndata.AnnData(X=sparse.csr_matrix(X), obs=good_obs, uns=good_uns, obsm=good_obsm, var=good_var)
print(adata)

adata.raw = adata.copy()
adata.X = non_raw_X
adata.raw.var.drop("feature_is_filtered", axis=1, inplace=True)

# Write to new file
modified_path = "cellxgene_schema_cli/tests/fixtures/h5ads/example_valid_modified.h5ad"
adata.write(modified_path, compression="gzip")

cellxgene_schema_cli/cellxgene_schema/validate.py

cellxgene_schema_cli/tests/test_schema_compliance.py

codecov · 2023-09-25T19:46:03Z

Codecov Report

Merging #625 (80d8069) into main (9b92401) will increase coverage by 0.23%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #625      +/-   ##
==========================================
+ Coverage   83.45%   83.68%   +0.23%     
==========================================
  Files          19       19              
  Lines        1710     1735      +25     
==========================================
+ Hits         1427     1452      +25     
  Misses        283      283

Flag	Coverage Δ
unittests	`83.68% <100.00%> (+0.23%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
cellxgene_schema_cli/cellxgene_schema/validate.py	`94.60% <100.00%> (+0.25%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

nayib-jose-gloria · 2023-09-26T18:37:02Z

cellxgene_schema_cli/cellxgene_schema/validate.py

+                    category_mapping[column_name] = df[column_name].nunique()
+
+        for column_name, num_unique_vals in category_mapping.items():
+            colors_options = uns_dict.get(f"{column_name}_colors", [])


We should only run validation against {column_name}_colors if it exists, as having a corresponding {column}_colors column is optional. As written, this will fail if a categorical column opts out of {column}_colors

joyceyan commented Sep 15, 2023

View reviewed changes

cellxgene_schema_cli/cellxgene_schema/validate.py Outdated Show resolved Hide resolved

atarashansky requested changes Sep 20, 2023

View reviewed changes

cellxgene_schema_cli/cellxgene_schema/validate.py Outdated Show resolved Hide resolved

cellxgene_schema_cli/cellxgene_schema/validate.py Outdated Show resolved Hide resolved

joyceyan added 4 commits September 21, 2023 18:00

column colors validation

f60fa65

fix linter

6b2b298

update logic for colors validation

3e73af3

fix linter

2aa18f1

joyceyan changed the title ~~feat: {column}_colors validation~~ feat: {column}_colors validation [wip] Sep 22, 2023

joyceyan added 2 commits September 22, 2023 12:48

fix minor things

b18156d

add test coverage and fix tests that don't use h5ad fixture

ee167df

joyceyan force-pushed the joyce/colors-validation branch from 1e8c5a8 to ee167df Compare September 22, 2023 21:05

joyceyan added 3 commits September 22, 2023 17:06

fix linter

72fadff

rm debug statements

e2736d5

update test fixtures

18dcf8d

joyceyan changed the title ~~feat: {column}_colors validation [wip]~~ feat: {column}_colors validation Sep 25, 2023

joyceyan added 3 commits September 25, 2023 14:14

rm ds store

8958ac0

remove line

7ae0565

rename test

96781f6

atarashansky requested changes Sep 25, 2023

View reviewed changes

cellxgene_schema_cli/tests/test_schema_compliance.py Outdated Show resolved Hide resolved

colors must be greater than or equal to number of categories

80d8069

atarashansky approved these changes Sep 25, 2023

View reviewed changes

joyceyan merged commit a62fd9d into main Sep 25, 2023
8 checks passed

joyceyan deleted the joyce/colors-validation branch September 25, 2023 20:03

nayib-jose-gloria reviewed Sep 26, 2023

View reviewed changes

joyceyan mentioned this pull request Sep 26, 2023

fix: only validate colors if the key is specified #647

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: {column}_colors validation #625

feat: {column}_colors validation #625

joyceyan commented Sep 15, 2023 •

edited

Loading

codecov bot commented Sep 25, 2023

nayib-jose-gloria Sep 26, 2023

feat: {column}_colors validation #625

feat: {column}_colors validation #625

Conversation

joyceyan commented Sep 15, 2023 • edited Loading

codecov bot commented Sep 25, 2023

Codecov Report

nayib-jose-gloria Sep 26, 2023

Choose a reason for hiding this comment

joyceyan commented Sep 15, 2023 •

edited

Loading