Combining GO with BioGRID data
0
0
Entering edit mode
6.5 years ago

I've downloaded the human PPIs and ID mappings from BioGRID and gene ontologies from geneontology.org, and I'd like to combine them into a single dataset. In their respective files, PPIs are identified by the entrez gene of each interactor while ontologies are associated to UnitProtKB IDs. However, in the BioGRID mappings file, some entrez genes are mapped to multiple UniProtKB IDs. Why does this happen? Considering this, what would be the most correct way to combine the two databases?

My background is in computer engineering, not biology or anything related, so I'm pretty confused. Help would be much appreciated :)

Thanks!

biogrid gene ontology go gene mappings • 1.4k views
ADD COMMENT
1
Entering edit mode

What do you mean by combining ? Do you need to put everything into one big file ? Depending on what you're actually trying to do, this may not be necessary. A protein-coding gene can lead to the production of different proteins hence a single gene identifier can be linked to multiple protein identifiers. It may depends on the task/question at hand but very often data integration is performed at the gene level so in your case, you would associate the gene ID with all GO terms associated with any of the gene's UniProt IDs. Also if the species you're working with are present in Ensembl, you could retrieve the gene annotations directly from there using BioMart or the perl API.

ADD REPLY
0
Entering edit mode

Yeah, I'd like to create a dataset where each entry is a gene and has PPI information as well as biological processes associated with it. Just to make sure I understood correctly: entrez gene are IDs for genes, while UniProtKB is for proteins? Why does BioGRID mappings list ~47k unique entrez genes for humans, considering humans only have around 20000?

I'll look into the alternatives you suggested. Thanks a lot!! :)

ADD REPLY
0
Entering edit mode

Yes, Entrez genes are genes and UniProt is for proteins. There are about 20000 protein-coding genes in the human genome. The other genes (about 22k annotated in Ensembl v90) are non-coding, i.e. they do not produce proteins. I don't know where you got that BioGRID has 47k human genes, their statistics page mentions 21.7k unique human genes.

ADD REPLY

Login before adding your answer.

Traffic: 2773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6