Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: {column}_colors validation #625

Merged
merged 13 commits into from
Sep 25, 2023
Merged

feat: {column}_colors validation #625

merged 13 commits into from
Sep 25, 2023

Conversation

joyceyan
Copy link
Contributor

@joyceyan joyceyan commented Sep 15, 2023

https://app.zenhub.com/workspaces/single-cell-5e2a191dad828d52cc78b028/issues/gh/chanzuckerberg/single-cell-curation/513

i generated the new example_valid.h5ad by running this in a python notebook. basically i just copied everything in examples_validate.py and wrote to an h5ad file

import pandas as pd
import numpy
import anndata
import os
from scipy import sparse
from cellxgene_schema.utils import get_hash_digest_column
from cellxgene_schema.validate import Validator

# -----------------------------------------------------------------#
# Manually creating minimal anndata objects.
#
# The valid objects mentioned below contain all valid cases covered in the schema, including multiple examples for
# fields that allow multiple valid options.
#
# This process entails:
# 1. Creating individual obs components: one valid dataframe, and one with labels (extra columns that are supposed
#   to be added by validator)
# 2. Creating individual var components: valid, and one with labels
# 3. Creating individual uns valid component
# 4. Creating expression matrices
# 5. Creating valid obsm
# 6. Putting all the components created in the previous steps into minimal anndata that used for testing in
#   the unittests

# Valid obs per schema
good_obs = pd.DataFrame(
    [
        [
            "CL:0000066",
            "EFO:0009899",
            "MONDO:0100096",
            "NCBITaxon:9606",
            "PATO:0000383",
            "UBERON:0002048",
            "tissue",
            True,
            "HANCESTRO:0575",
            "HsapDv:0000003",
            "donor_1",
            "nucleus",
        ],
        [
            "CL:0000192",
            "EFO:0009918",
            "PATO:0000461",
            "NCBITaxon:10090",
            "unknown",
            "CL:0000192",
            "cell culture",
            False,
            "na",
            "MmusDv:0000003",
            "donor_2",
            "na",
        ],
    ],
    index=["X", "Y"],
    columns=[
        "cell_type_ontology_term_id",
        "assay_ontology_term_id",
        "disease_ontology_term_id",
        "organism_ontology_term_id",
        "sex_ontology_term_id",
        "tissue_ontology_term_id",
        "tissue_type",
        "is_primary_data",
        "self_reported_ethnicity_ontology_term_id",
        "development_stage_ontology_term_id",
        "donor_id",
        "suspension_type",
    ],
)

good_obs.loc[:, ["donor_id"]] = good_obs.astype("category")
good_obs.loc[:, ["suspension_type"]] = good_obs.astype("category")
good_obs.loc[:, ["tissue_type"]] = good_obs.astype("category")

# Expected obs, this is what the obs above should look like after adding the necessary columns with the validator,
# these columns are defined in the schema
obs_expected = pd.DataFrame(
    [
        [
            "epithelial cell",
            "10x 3' v2",
            "COVID-19",
            "Homo sapiens",
            "female",
            "lung",
            "Yoruban",
            "Carnegie stage 01",
        ],
        [
            "smooth muscle cell",
            "smFISH",
            "normal",
            "Mus musculus",
            "unknown",
            "smooth muscle cell",
            "na",
            "Theiler stage 01",
        ],
    ],
    index=["X", "Y"],
    columns=[
        "cell_type",
        "assay",
        "disease",
        "organism",
        "sex",
        "tissue",
        "self_reported_ethnicity",
        "development_stage",
    ],
)

obs_expected["observation_joinid"] = get_hash_digest_column(obs_expected)

# ---
# 2. Creating individual var components: valid object and valid object and with labels

# Valid var per schema
good_var = pd.DataFrame(
    [
        [False],
        [False],
        [False],
        [False],
    ],
    index=["ERCC-00002", "ENSG00000127603", "ENSMUSG00000059552", "ENSSASG00005000004"],
    columns=["feature_is_filtered"],
)

# Expected var, this is what the obs above should look like after adding the necessary columns with the validator,
# these columns are defined in the schema
var_expected = pd.DataFrame(
    [
        ["spike-in", False, "ERCC-00002 (spike-in control)", "NCBITaxon:32630", 0],
        ["gene", False, "MACF1", "NCBITaxon:9606", 42738],
        ["gene", False, "Trp53", "NCBITaxon:10090", 4045],
        ["gene", False, "S", "NCBITaxon:2697049", 3822],
    ],
    index=["ERCC-00002", "ENSG00000127603", "ENSMUSG00000059552", "ENSSASG00005000004"],
    columns=[
        "feature_biotype",
        "feature_is_filtered",
        "feature_name",
        "feature_reference",
        "feature_length",
    ],
)

# ---
# 3. Creating individual uns component
good_uns = {
    "title": "A title",
    "default_embedding": "X_umap",
    "X_approximate_distribution": "normal",
    "batch_condition": ["is_primary_data"],
    "donor_id_colors": ["black", "pink"],
    "suspension_type_colors": ["red", "#000000"],
    "tissue_type_colors": ["blue", "#ffffff"],
}

good_uns_with_labels = {
    "schema_version": "4.0.0",
    "schema_reference": "https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/4.0.0/schema.md",
    "title": "A title",
    "default_embedding": "X_umap",
    "X_approximate_distribution": "normal",
    "batch_condition": ["is_primary_data"],
    "donor_id_colors": ["black", "pink"],
    "suspension_type_colors": ["red", "#000000"],
    "tissue_type_colors": ["blue", "#ffffff"],
}

# ---
# 4. Creating expression matrix,
# X has integer values and non_raw_X has real values
X = numpy.zeros([good_obs.shape[0], good_var.shape[0]], dtype=numpy.float32)
non_raw_X = sparse.csr_matrix(X.copy())
non_raw_X[0, 0] = 1.5


# ---
# 5.Creating valid obsm
good_obsm = {"X_umap": numpy.zeros([X.shape[0], 2])}


# ---
# 6. Putting all the components created in the previous steps into minimal anndata that used for testing in
#   the unittests

# Valid anndata
adata = anndata.AnnData(X=sparse.csr_matrix(X), obs=good_obs, uns=good_uns, obsm=good_obsm, var=good_var)
print(adata)

adata.raw = adata.copy()
adata.X = non_raw_X
adata.raw.var.drop("feature_is_filtered", axis=1, inplace=True)

# Write to new file
modified_path = "cellxgene_schema_cli/tests/fixtures/h5ads/example_valid_modified.h5ad"
adata.write(modified_path, compression="gzip")

@joyceyan joyceyan changed the title feat: {column}_colors validation feat: {column}_colors validation [wip] Sep 22, 2023
@joyceyan joyceyan force-pushed the joyce/colors-validation branch from 1e8c5a8 to ee167df Compare September 22, 2023 21:05
@joyceyan joyceyan changed the title feat: {column}_colors validation [wip] feat: {column}_colors validation Sep 25, 2023
@codecov
Copy link

codecov bot commented Sep 25, 2023

Codecov Report

Merging #625 (80d8069) into main (9b92401) will increase coverage by 0.23%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #625      +/-   ##
==========================================
+ Coverage   83.45%   83.68%   +0.23%     
==========================================
  Files          19       19              
  Lines        1710     1735      +25     
==========================================
+ Hits         1427     1452      +25     
  Misses        283      283              
Flag Coverage Δ
unittests 83.68% <100.00%> (+0.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
cellxgene_schema_cli/cellxgene_schema/validate.py 94.60% <100.00%> (+0.25%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@joyceyan joyceyan merged commit a62fd9d into main Sep 25, 2023
8 checks passed
@joyceyan joyceyan deleted the joyce/colors-validation branch September 25, 2023 20:03
category_mapping[column_name] = df[column_name].nunique()

for column_name, num_unique_vals in category_mapping.items():
colors_options = uns_dict.get(f"{column_name}_colors", [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only run validation against {column_name}_colors if it exists, as having a corresponding {column}_colors column is optional. As written, this will fail if a categorical column opts out of {column}_colors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants