feat: add feature length field, annotated by add-labels function #624

nayib-jose-gloria · 2023-09-15T18:41:02Z

Changes:

annotate feature_length field in add-labels function, based on feature_id and pre-computed gene info CSV files (derived + committed to repo during ontology bumps). Set to 0 for non-gene features (even though a length is calculated + available for spike-in controls)
refactor gene ontology tool that fetches info from gene CSV files to accommodate fetching calculated feature_length
update tests

codecov · 2023-09-15T19:21:58Z

Codecov Report

Merging #624 (e0d430e) into main (ccad8d9) will increase coverage by 0.14%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #624      +/-   ##
==========================================
+ Coverage   83.02%   83.17%   +0.14%     
==========================================
  Files          19       19              
  Lines        1703     1718      +15     
==========================================
+ Hits         1414     1429      +15     
  Misses        289      289

Flag	Coverage Δ
unittests	`83.17% <100.00%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
cellxgene_schema_cli/cellxgene_schema/ontology.py	`94.44% <100.00%> (+0.32%)`	⬆️
...lxgene_schema_cli/cellxgene_schema/write_labels.py	`94.51% <100.00%> (+0.35%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

danieljhegeman · 2023-09-15T23:30:14Z

for spike-in proteins

I can't believe I'm asking this but what is a "spike-in protein"? From Google:

Spike-in controls are synthetic nucleic-acid sequences that are added to a user's sample and constitute internal standards for subsequent steps in the next generation sequencing workflow

cellxgene_schema_cli/cellxgene_schema/ontology.py

nayib-jose-gloria · 2023-09-16T15:00:17Z

for spike-in proteins

I can't believe I'm asking this but what is a "spike-in protein"? From Google:

Spike-in controls are synthetic nucleic-acid sequences that are added to a user's sample and constitute internal standards for subsequent steps in the next generation sequencing workflow

oops--I did mean spike-in control, I have no idea why I wrote protein 😅 sorry!

danieljhegeman · 2023-09-18T18:53:14Z

@nayib-jose-gloria tangentially-related: would you mind cleaning up this method description, i.e. turn it into parseable English 🙏

single-cell-curation/cellxgene_schema_cli/cellxgene_schema/write_labels.py

Lines 324 to 325 in f6bb232

    
                   From a valid (per cellxgene's schema) adata, this function adds to self.adata ontology/gene labels 
        
                   to adata.obs, adata.var, and adata.raw.var respectively

Bento007 · 2023-09-18T21:04:41Z

cellxgene_schema_cli/cellxgene_schema/ontology.py

-        return self.gene_dict[gene_id]
+        return self.gene_dict[gene_id][0]
+
+    def get_length(self, gene_id) -> int:


type gene_id

Bento007 · 2023-09-18T21:08:03Z

cellxgene_schema_cli/cellxgene_schema/write_labels.py

+        mapping_dict = {}
+
+        for i in ids:
+            if i.startswith("ENS"):


Should we start putting hardcoded string into a constant.py file? Not sure we have enough cases for it to make sense.

Might be a good thing to do at the end of the epic? Or we can start now

I don't think it's necessary at this point, we can revisit if we find that it's coming up frequently and causing confusion.

Bento007 · 2023-09-18T21:10:34Z

cellxgene_schema_cli/tests/test_schema_compliance.py

+        for df in ["var", "raw.var"]:
+            for i in self.validator.schema_def["components"][df]["index"]["add_labels"]:
+                column = i["to_column"]
+                with self.subTest(column=column, df=df):


use https://docs.pytest.org/en/7.3.x/how-to/parametrize.html

(repeat from another PR) let's sync on this today--this whole module of schema compliance tests uses unittest libraries / syntax, so switching over to pytest I think is out of scope for this ticket + and the other open issues. But we can discuss if it's worth ticketing separately to swap everything over.

https://app.zenhub.com/workspaces/single-cell-5e2a191dad828d52cc78b028/issues/gh/chanzuckerberg/single-cell-curation/633 tracking a separate issue for this

Bento007 · 2023-09-18T21:18:39Z

cellxgene_schema_cli/tests/test_schema_compliance.py

+                column = i["to_column"]
+                with self.subTest(column=column, df=df):
+                    # Resetting validator
+                    self.validator.adata = examples.adata.copy()


I believe these 2 lines can be removed if you switch to a pytest approach.

danieljhegeman · 2023-09-18T21:37:21Z

cellxgene_schema_cli/cellxgene_schema/ontology.py

+        :return A gene length
+        """
+
+        if not self.is_valid_id(gene_id):


@Bento007 has encouraged me to avoid negatives a la if not... where possible

Could turn the last few lines of this method into a one-liner:
return self.gene_dict[gene_id][1] if self.is_valid_id(gene_id) else raise ValueError(f"The id '{gene_id}' is not a valid ENSEMBL id for '{self.species}'")

🙂

I'll update to avoid the negative, but I personally find one-liners harder to read

danieljhegeman · 2023-09-18T21:44:07Z

cellxgene_schema_cli/tests/test_validate.py

+        self.assertEqual(self.writer._get_mapping_dict_feature_length(ids), expected_dict)
+
+        # Bad
+        ids = ["NO_GENE"]


For all of the test_get_dictionary_* tests in TestAddLabelFunctions, you wish to consider factoring out common dict comparison logic and ids initial variables. Could create a helper function or could simply group all under a single test. Not required though.

feat: add feature length field, annotated by add-labels function

f6bb232

nayib-jose-gloria requested review from Bento007 and danieljhegeman September 15, 2023 18:41

nayib-jose-gloria added 2 commits September 15, 2023 15:17

add more tests

78c558f

new line

fb60f0b

danieljhegeman reviewed Sep 15, 2023

View reviewed changes

cellxgene_schema_cli/cellxgene_schema/ontology.py Outdated Show resolved Hide resolved

danieljhegeman self-requested a review September 18, 2023 18:54

nayib-jose-gloria added 2 commits September 18, 2023 15:01

Update docstrings

5cea47c

more descriptive docstring for add_labels

3d8cb5c

Bento007 requested changes Sep 18, 2023

View reviewed changes

danieljhegeman reviewed Sep 18, 2023

View reviewed changes

danieljhegeman approved these changes Sep 18, 2023

View reviewed changes

add types + small if condition refactor

e0d430e

nayib-jose-gloria requested a review from Bento007 September 19, 2023 15:06

Bento007 approved these changes Sep 19, 2023

View reviewed changes

nayib-jose-gloria merged commit 1f482ae into main Sep 19, 2023
8 checks passed

nayib-jose-gloria deleted the nayib/add-feature-length branch September 19, 2023 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add feature length field, annotated by add-labels function #624

feat: add feature length field, annotated by add-labels function #624

nayib-jose-gloria commented Sep 15, 2023 •

edited

Loading

codecov bot commented Sep 15, 2023 •

edited

Loading

danieljhegeman commented Sep 15, 2023

nayib-jose-gloria commented Sep 16, 2023

danieljhegeman commented Sep 18, 2023 •

edited

Loading

Bento007 Sep 18, 2023

Bento007 Sep 18, 2023

danieljhegeman Sep 18, 2023

nayib-jose-gloria Sep 19, 2023

Bento007 Sep 18, 2023

nayib-jose-gloria Sep 19, 2023

nayib-jose-gloria Sep 19, 2023

Bento007 Sep 18, 2023

danieljhegeman Sep 18, 2023 •

edited

Loading

nayib-jose-gloria Sep 19, 2023

danieljhegeman Sep 18, 2023

feat: add feature length field, annotated by add-labels function #624

feat: add feature length field, annotated by add-labels function #624

Conversation

nayib-jose-gloria commented Sep 15, 2023 • edited Loading

codecov bot commented Sep 15, 2023 • edited Loading

Codecov Report

danieljhegeman commented Sep 15, 2023

nayib-jose-gloria commented Sep 16, 2023

danieljhegeman commented Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danieljhegeman Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nayib-jose-gloria commented Sep 15, 2023 •

edited

Loading

codecov bot commented Sep 15, 2023 •

edited

Loading

danieljhegeman commented Sep 18, 2023 •

edited

Loading

danieljhegeman Sep 18, 2023 •

edited

Loading