Load complex portal UIDs into Chado and put them in the GPAD file #1166

ValWood · 2024-04-27T08:43:44Z

I don't see complex portal IDs in Noctua, so I guess we need to add them to the ~~GPAD~~. GPI?

v

cc @PCarme

kimrutherford · 2024-04-30T10:25:45Z

SGD complex example: https://www.yeastgenome.org/complex/CPX-2262

Action for Kim: check the SGD GPI/GPAD file for this complex.

kimrutherford · 2024-04-30T21:26:15Z

@ValWood How do we associate gene IDs with complex portal IDs? Is there a mapping file?

kimrutherford · 2024-04-30T22:21:32Z

On the call I was wondering what SO term to use in column 5 of the GPI file (DB_Object_Type). I had a look at the GPAD/GPI 2.0 spec and it says:

the entity type in column 5 is captured using an ID from the Sequence Ontology, Protein Ontology, or
Gene Ontology

So probably we can use something like protein-containing complex (GO:0032991) as the type for complexes.

check the SGD GPI/GPAD file for this complex.

SGD are still on GPAD/GPI v1.2 which doesn't need a term ID for the object type.

This is the line in the SGD GPI file for that complex:

SGD     S000218145      CPX-2262                26S Proteasome complex|Proteasome Activator|2f16|2gpl|2zcy|3hye|1g0u|3bdm|3dy3|3dy4|3gpw|3gpt|1z7q|3e47|3d29|1jd2|2fak|4v7o|1g65|1ryp|3gpj|3JCK|3.4.25.1|4cr4|2596|4cr3|2595|3JCO|3JCP|4cr2|2594|3.4.19.12        protein_complex taxon:55929

The type is just protein_complex

kimrutherford · 2024-05-01T04:31:05Z

As a first step I've added a build step to load a file with a mapping from gene systematic IDs to Complex Portal IDs.

The file is: pombe-embl/supporting_files/protein_complex_id_mapping.tsv

The three tab separated columns are:

gene systematic ID
Complex Portal ID
PubMed ID (maybe a Complex Portal paper?)

The PubMed ID is require by Chado.

It's currently an empty file.

Refs: pombase/pombase-chado#1166

Refs #1166

kimrutherford · 2024-05-01T05:02:09Z

After we have added some complexes to the mapping file and successfully loaded them into Chado, I'll change the GPI writer to include the complex details.

kimrutherford · 2024-05-01T08:29:54Z

So probably we can use something like protein-containing complex (GO:0032991) as the type for complexes.

GO:0032991 is what the GO db-xrefs file says.

ValWood · 2024-05-01T08:36:14Z

The type is just protein_complex

We should use a broader term if there is one, (to cover for protein-RNA complexes)
I can't even find protein_complex in SO?

ValWood · 2024-05-01T08:37:26Z

are you using GO protein complex ID? if so use "protein-containing complex (GO:0032991"

kimrutherford · 2024-05-01T08:43:47Z

I can't even find protein_complex in SO?

The GPI 1.2 spec allow "protein_complex" as a special case:

DB_Object_Type

A description of the type of the gene or gene product being annotated. This field uses Sequence Ontology labels and may correspond to one of the following: gene, protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; or any subtype of ncRNA in the Sequence Ontology.

https://geneontology.org/docs/gene-product-information-gpi-format/#db_object_type

kimrutherford · 2024-05-01T08:44:28Z

are you using GO protein complex ID? if so use "protein-containing complex (GO:0032991"

Yep, that's what I'm using.

Refs pombase/pombase-chado#1166

kimrutherford · 2024-05-01T09:06:24Z

After we have added some complexes to the mapping file and successfully loaded them into Chado, I'll change the GPI writer to include the complex details.

I added some fake protein complex data to my local test Chado database. So I've now implemented and tested writing the complexes to the GPI file.

The complexes will start appearing in the GPI file once we have some complexes in pombe-embl/supporting_files/protein_complex_id_mapping.tsv

ValWood · 2024-05-01T09:14:03Z

https://www.ebi.ac.uk/complexportal/complex/organisms
It might. be the "complex tab" file here,
but I don't think the download is working?

ValWood · 2024-05-01T09:19:26Z

We do have some real data in the spreadsheet Sandra shared with us
https://docs.google.com/spreadsheets/d/1S4qU55KgNAKLsfXr-4DCb06jKgXcvt3kcAupSNPvl5Y/edit#gid=0

We will need to be careful mapping using gene names (Complex Portal will likely use the UniPRot gene names, and sometimes their names are inferred from S. cerevisiae and are not the official names. We probably need to use UnIProt identifiers instead in the 'real' conversion)

ValWood · 2024-05-01T09:21:03Z

Or, we can use the link from
CPX-555 -> GO:0005955 and then use our genes from
https://www.pombase.org/data/annotations/Gene_ontology/GO_complexes/

kimrutherford · 2024-05-01T09:22:45Z

Or, we can use the link from CPX-555 -> GO:0005955

Where is that link?

kimrutherford · 2024-05-01T09:23:01Z

https://www.ebi.ac.uk/complexportal/complex/organisms
It might. be the "complex tab" file here,
but I don't think the download is working?

It didn't work for me either. I found the TSV files here:
http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/
http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/284812.tsv

kimrutherford · 2024-05-02T04:57:41Z

http://ftp.ebi.ac.uk/pub/databases/intact/complex/current/complextab/284812.tsv

Annoyingly the gene IDs are UniProt IDs but we can look them up in Chado when loading.

Refs pombase/pombase-chado#1166

Refs #1166

Sets feature names from a file of feature uniquenames and names. Refs #1166

Loads complex portal IDs and names and the mapping to genes. Genes are part_of complexes via feature_relationship. Refs pombase/pombase-chado#1166

kimrutherford · 2024-05-03T11:13:33Z

The complexes will start appearing in the GPI file once we have some complexes in pombe-embl/supporting_files/protein_complex_id_mapping.tsv

New plan: we now download the data file from Complex Portal when it changes, then load the details into Chado.

I'll check in the morning that it's all OK. We should have the complex IDs in the GPI from tomorrow.

Annoyingly the gene IDs are UniProt IDs but we can look them up in Chado when loading

That is handled by the load script.

We were only storing one feature_relationship per complex when loading pombe_to_complex_id_mapping.tsv Refs #1166

kimrutherford · 2024-05-04T03:06:27Z

I'll check in the morning that it's all OK. We should have the complex IDs in the GPI from tomorrow.

The GPI has the complexes now. But I think I missed the Protein_Containing_Complex_Members field (column 9):
https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

I'll change the GPI writer to fill this in.

Column 9: "Protein_Containing_Complex_Members" Refs pombase/pombase-chado#1166

kimrutherford · 2024-05-06T22:26:56Z

But I think I missed the Protein_Containing_Complex_Members field (column 9):
I'll change the GPI writer to fill this in.

I've done that in time for the nightly load.

The example in the GPI docs shows UniProt IDs but the spec implies that any ID will do. I've used PomBase gene IDs for now. I'll change it if there's a problem.

kimrutherford · 2024-09-10T00:17:35Z

The example in the GPI docs shows UniProt IDs but the spec implies that any ID will do. I've used PomBase gene IDs for now. I'll change it if there's a problem.

We should check this with the Noctua people. Perhaps we need to use UniProt IDs in field 9 ("Protein_Containing_Complex_Members") of the GPI file?

kimrutherford · 2024-09-10T00:29:46Z

We should check this with the Noctua people.

I've commented here:

Question: can't locate complexes specified in our GPI geneontology/noctua#910 (comment)

ValWood assigned kimrutherford Apr 27, 2024

ValWood added the high priority label Apr 27, 2024

ValWood changed the title ~~Load complex portal UIDs into Cahdo and put them in the GPAD file~~ Load complex portal UIDs into Chado and put them in the GPAD file Apr 30, 2024

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue May 1, 2024

Load Complex Portal ID mapping

d606403

Refs: pombase/pombase-chado#1166

kimrutherford added a commit that referenced this issue May 1, 2024

Add option to create feature in GenericFeaturePub

aea53a3

Refs #1166

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue May 1, 2024

Write protein complex lines to GPI file

0dddae4

Refs pombase/pombase-chado#1166

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue May 1, 2024

Fix failing test

6c6fc97

Refs pombase/pombase-chado#1166

kimrutherford added a commit to pombase/website that referenced this issue May 2, 2024

Include ComplexPortal when reading db-xref.yaml

6a4ccfc

Refs pombase/pombase-chado#1166

kimrutherford added a commit that referenced this issue May 3, 2024

Add script to parse the Complex Portal data file

1ae2cbe

Refs #1166

kimrutherford added a commit that referenced this issue May 3, 2024

Add a new load option: generic-feature-names

fdcbfde

Sets feature names from a file of feature uniquenames and names. Refs #1166

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue May 3, 2024

Download and load Complex Portal data file

8157d04

Loads complex portal IDs and names and the mapping to genes. Genes are part_of complexes via feature_relationship. Refs pombase/pombase-chado#1166

kimrutherford added the needs testing label May 3, 2024

kimrutherford added a commit that referenced this issue May 3, 2024

Fix missing feature_relationship loading complexes

ec6cdaf

We were only storing one feature_relationship per complex when loading pombe_to_complex_id_mapping.tsv Refs #1166

kimrutherford added a commit to pombase/pombase-chado-json that referenced this issue May 6, 2024

Include column 9 when writing complexes to GPI

c79960e

Column 9: "Protein_Containing_Complex_Members" Refs pombase/pombase-chado#1166

kimrutherford closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load complex portal UIDs into Chado and put them in the GPAD file #1166

Load complex portal UIDs into Chado and put them in the GPAD file #1166

ValWood commented Apr 27, 2024 •

edited

Loading

kimrutherford commented Apr 30, 2024

kimrutherford commented Apr 30, 2024

kimrutherford commented Apr 30, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

ValWood commented May 1, 2024

ValWood commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

ValWood commented May 1, 2024 •

edited

Loading

ValWood commented May 1, 2024

ValWood commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 2, 2024

kimrutherford commented May 3, 2024

kimrutherford commented May 4, 2024

kimrutherford commented May 6, 2024

kimrutherford commented Sep 10, 2024

kimrutherford commented Sep 10, 2024

Load complex portal UIDs into Chado and put them in the GPAD file #1166

Load complex portal UIDs into Chado and put them in the GPAD file #1166

Comments

ValWood commented Apr 27, 2024 • edited Loading

kimrutherford commented Apr 30, 2024

kimrutherford commented Apr 30, 2024

kimrutherford commented Apr 30, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

ValWood commented May 1, 2024

ValWood commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

ValWood commented May 1, 2024 • edited Loading

ValWood commented May 1, 2024

ValWood commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 1, 2024

kimrutherford commented May 2, 2024

kimrutherford commented May 3, 2024

kimrutherford commented May 4, 2024

kimrutherford commented May 6, 2024

kimrutherford commented Sep 10, 2024

kimrutherford commented Sep 10, 2024

ValWood commented Apr 27, 2024 •

edited

Loading

ValWood commented May 1, 2024 •

edited

Loading