Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load complex portal UIDs into Chado and put them in the GPAD file #1166

Closed
ValWood opened this issue Apr 27, 2024 · 22 comments
Closed

Load complex portal UIDs into Chado and put them in the GPAD file #1166

ValWood opened this issue Apr 27, 2024 · 22 comments

Comments

@ValWood
Copy link
Member

ValWood commented Apr 27, 2024

I don't see complex portal IDs in Noctua, so I guess we need to add them to the GPAD. GPI?

v

cc @PCarme

@kimrutherford
Copy link
Member

SGD complex example: https://www.yeastgenome.org/complex/CPX-2262

Action for Kim: check the SGD GPI/GPAD file for this complex.

@ValWood ValWood changed the title Load complex portal UIDs into Cahdo and put them in the GPAD file Load complex portal UIDs into Chado and put them in the GPAD file Apr 30, 2024
@kimrutherford
Copy link
Member

@ValWood How do we associate gene IDs with complex portal IDs? Is there a mapping file?

@kimrutherford
Copy link
Member

On the call I was wondering what SO term to use in column 5 of the GPI file (DB_Object_Type). I had a look at the GPAD/GPI 2.0 spec and it says:

the entity type in column 5 is captured using an ID from the Sequence Ontology, Protein Ontology, or
Gene Ontology

So probably we can use something like protein-containing complex (GO:0032991) as the type for complexes.

check the SGD GPI/GPAD file for this complex.

SGD are still on GPAD/GPI v1.2 which doesn't need a term ID for the object type.

This is the line in the SGD GPI file for that complex:

SGD     S000218145      CPX-2262                26S Proteasome complex|Proteasome Activator|2f16|2gpl|2zcy|3hye|1g0u|3bdm|3dy3|3dy4|3gpw|3gpt|1z7q|3e47|3d29|1jd2|2fak|4v7o|1g65|1ryp|3gpj|3JCK|3.4.25.1|4cr4|2596|4cr3|2595|3JCO|3JCP|4cr2|2594|3.4.19.12        protein_complex taxon:55929

The type is just protein_complex

@kimrutherford
Copy link
Member

As a first step I've added a build step to load a file with a mapping from gene systematic IDs to Complex Portal IDs.

The file is: pombe-embl/supporting_files/protein_complex_id_mapping.tsv

The three tab separated columns are:

  • gene systematic ID
  • Complex Portal ID
  • PubMed ID (maybe a Complex Portal paper?)

The PubMed ID is require by Chado.

It's currently an empty file.

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue May 1, 2024
@kimrutherford
Copy link
Member

After we have added some complexes to the mapping file and successfully loaded them into Chado, I'll change the GPI writer to include the complex details.

@kimrutherford
Copy link
Member

So probably we can use something like protein-containing complex (GO:0032991) as the type for complexes.

GO:0032991 is what the GO db-xrefs file says.

@ValWood
Copy link
Member Author

ValWood commented May 1, 2024

The type is just protein_complex

We should use a broader term if there is one, (to cover for protein-RNA complexes)
I can't even find protein_complex in SO?

@ValWood
Copy link
Member Author

ValWood commented May 1, 2024

are you using GO protein complex ID? if so use "protein-containing complex (GO:0032991"

@kimrutherford
Copy link
Member

I can't even find protein_complex in SO?

The GPI 1.2 spec allow "protein_complex" as a special case:


DB_Object_Type

A description of the type of the gene or gene product being annotated. This field uses Sequence Ontology labels and may correspond to one of the following: gene, protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; or any subtype of ncRNA in the Sequence Ontology.

https://geneontology.org/docs/gene-product-information-gpi-format/#db_object_type

@kimrutherford
Copy link
Member

are you using GO protein complex ID? if so use "protein-containing complex (GO:0032991"

Yep, that's what I'm using.

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue May 1, 2024
kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue May 1, 2024
@kimrutherford
Copy link
Member

After we have added some complexes to the mapping file and successfully loaded them into Chado, I'll change the GPI writer to include the complex details.

I added some fake protein complex data to my local test Chado database. So I've now implemented and tested writing the complexes to the GPI file.

The complexes will start appearing in the GPI file once we have some complexes in pombe-embl/supporting_files/protein_complex_id_mapping.tsv

@ValWood
Copy link
Member Author

ValWood commented May 1, 2024

https://www.ebi.ac.uk/complexportal/complex/organisms
It might. be the "complex tab" file here,
but I don't think the download is working?

@ValWood
Copy link
Member Author

ValWood commented May 1, 2024

We do have some real data in the spreadsheet Sandra shared with us
https://docs.google.com/spreadsheets/d/1S4qU55KgNAKLsfXr-4DCb06jKgXcvt3kcAupSNPvl5Y/edit#gid=0

We will need to be careful mapping using gene names (Complex Portal will likely use the UniPRot gene names, and sometimes their names are inferred from S. cerevisiae and are not the official names. We probably need to use UnIProt identifiers instead in the 'real' conversion)

@ValWood
Copy link
Member Author

ValWood commented May 1, 2024

Or, we can use the link from
CPX-555 -> GO:0005955 and then use our genes from
https://www.pombase.org/data/annotations/Gene_ontology/GO_complexes/

@kimrutherford
Copy link
Member

Or, we can use the link from CPX-555 -> GO:0005955

Where is that link?

@kimrutherford
Copy link
Member

https://www.ebi.ac.uk/complexportal/complex/organisms
It might. be the "complex tab" file here,
but I don't think the download is working?

It didn't work for me either. I found the TSV files here:
http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/
http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/284812.tsv

@kimrutherford
Copy link
Member

http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/284812.tsv

Annoyingly the gene IDs are UniProt IDs but we can look them up in Chado when loading.

kimrutherford added a commit to pombase/website that referenced this issue May 2, 2024
kimrutherford added a commit that referenced this issue May 3, 2024
Sets feature names from a file of feature uniquenames and names.

Refs #1166
kimrutherford added a commit to pombase/pombase-legacy that referenced this issue May 3, 2024
Loads complex portal IDs and names and the mapping to genes.
Genes are part_of complexes via feature_relationship.

Refs pombase/pombase-chado#1166
@kimrutherford
Copy link
Member

The complexes will start appearing in the GPI file once we have some complexes in pombe-embl/supporting_files/protein_complex_id_mapping.tsv

New plan: we now download the data file from Complex Portal when it changes, then load the details into Chado.

I'll check in the morning that it's all OK. We should have the complex IDs in the GPI from tomorrow.

Annoyingly the gene IDs are UniProt IDs but we can look them up in Chado when loading

That is handled by the load script.

kimrutherford added a commit that referenced this issue May 3, 2024
We were only storing one feature_relationship per complex when loading
pombe_to_complex_id_mapping.tsv

Refs #1166
@kimrutherford
Copy link
Member

I'll check in the morning that it's all OK. We should have the complex IDs in the GPI from tomorrow.

The GPI has the complexes now. But I think I missed the Protein_Containing_Complex_Members field (column 9):
https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

I'll change the GPI writer to fill this in.

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue May 6, 2024
Column 9: "Protein_Containing_Complex_Members"

Refs pombase/pombase-chado#1166
@kimrutherford
Copy link
Member

But I think I missed the Protein_Containing_Complex_Members field (column 9):
I'll change the GPI writer to fill this in.

I've done that in time for the nightly load.

The example in the GPI docs shows UniProt IDs but the spec implies that any ID will do. I've used PomBase gene IDs for now. I'll change it if there's a problem.

@kimrutherford
Copy link
Member

The example in the GPI docs shows UniProt IDs but the spec implies that any ID will do. I've used PomBase gene IDs for now. I'll change it if there's a problem.

We should check this with the Noctua people. Perhaps we need to use UniProt IDs in field 9 ("Protein_Containing_Complex_Members") of the GPI file?

@kimrutherford
Copy link
Member

We should check this with the Noctua people.

I've commented here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants