Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add feature length field, annotated by add-labels function #624

Merged
merged 6 commits into from
Sep 19, 2023

Conversation

nayib-jose-gloria
Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria commented Sep 15, 2023

#516

Changes:

  • annotate feature_length field in add-labels function, based on feature_id and pre-computed gene info CSV files (derived + committed to repo during ontology bumps). Set to 0 for non-gene features (even though a length is calculated + available for spike-in controls)
  • refactor gene ontology tool that fetches info from gene CSV files to accommodate fetching calculated feature_length
  • update tests

@codecov
Copy link

codecov bot commented Sep 15, 2023

Codecov Report

Merging #624 (e0d430e) into main (ccad8d9) will increase coverage by 0.14%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #624      +/-   ##
==========================================
+ Coverage   83.02%   83.17%   +0.14%     
==========================================
  Files          19       19              
  Lines        1703     1718      +15     
==========================================
+ Hits         1414     1429      +15     
  Misses        289      289              
Flag Coverage Δ
unittests 83.17% <100.00%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
cellxgene_schema_cli/cellxgene_schema/ontology.py 94.44% <100.00%> (+0.32%) ⬆️
...lxgene_schema_cli/cellxgene_schema/write_labels.py 94.51% <100.00%> (+0.35%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@danieljhegeman
Copy link

for spike-in proteins

I can't believe I'm asking this but what is a "spike-in protein"? From Google:

Spike-in controls are synthetic nucleic-acid sequences that are added to a user's sample and constitute internal standards for subsequent steps in the next generation sequencing workflow

@nayib-jose-gloria
Copy link
Contributor Author

for spike-in proteins

I can't believe I'm asking this but what is a "spike-in protein"? From Google:

Spike-in controls are synthetic nucleic-acid sequences that are added to a user's sample and constitute internal standards for subsequent steps in the next generation sequencing workflow

oops--I did mean spike-in control, I have no idea why I wrote protein 😅 sorry!

@danieljhegeman
Copy link

danieljhegeman commented Sep 18, 2023

@nayib-jose-gloria tangentially-related: would you mind cleaning up this method description, i.e. turn it into parseable English 🙏

From a valid (per cellxgene's schema) adata, this function adds to self.adata ontology/gene labels
to adata.obs, adata.var, and adata.raw.var respectively

@danieljhegeman danieljhegeman self-requested a review September 18, 2023 18:54
return self.gene_dict[gene_id]
return self.gene_dict[gene_id][0]

def get_length(self, gene_id) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type gene_id

mapping_dict = {}

for i in ids:
if i.startswith("ENS"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start putting hardcoded string into a constant.py file? Not sure we have enough cases for it to make sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good thing to do at the end of the epic? Or we can start now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary at this point, we can revisit if we find that it's coming up frequently and causing confusion.

for df in ["var", "raw.var"]:
for i in self.validator.schema_def["components"][df]["index"]["add_labels"]:
column = i["to_column"]
with self.subTest(column=column, df=df):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(repeat from another PR) let's sync on this today--this whole module of schema compliance tests uses unittest libraries / syntax, so switching over to pytest I think is out of scope for this ticket + and the other open issues. But we can discuss if it's worth ticketing separately to swap everything over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column = i["to_column"]
with self.subTest(column=column, df=df):
# Resetting validator
self.validator.adata = examples.adata.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these 2 lines can be removed if you switch to a pytest approach.

:return A gene length
"""

if not self.is_valid_id(gene_id):
Copy link

@danieljhegeman danieljhegeman Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bento007 has encouraged me to avoid negatives a la if not... where possible

Could turn the last few lines of this method into a one-liner:
return self.gene_dict[gene_id][1] if self.is_valid_id(gene_id) else raise ValueError(f"The id '{gene_id}' is not a valid ENSEMBL id for '{self.species}'")

🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update to avoid the negative, but I personally find one-liners harder to read

self.assertEqual(self.writer._get_mapping_dict_feature_length(ids), expected_dict)

# Bad
ids = ["NO_GENE"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of the test_get_dictionary_* tests in TestAddLabelFunctions, you wish to consider factoring out common dict comparison logic and ids initial variables. Could create a helper function or could simply group all under a single test. Not required though.

@nayib-jose-gloria nayib-jose-gloria merged commit 1f482ae into main Sep 19, 2023
8 checks passed
@nayib-jose-gloria nayib-jose-gloria deleted the nayib/add-feature-length branch September 19, 2023 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants