Question: Combining GO with BioGRID data
gravatar for denilsonjnr6
2.1 years ago by
denilsonjnr60 wrote:

I've downloaded the human PPIs and ID mappings from BioGRID and gene ontologies from, and I'd like to combine them into a single dataset. In their respective files, PPIs are identified by the entrez gene of each interactor while ontologies are associated to UnitProtKB IDs. However, in the BioGRID mappings file, some entrez genes are mapped to multiple UniProtKB IDs. Why does this happen? Considering this, what would be the most correct way to combine the two databases?

My background is in computer engineering, not biology or anything related, so I'm pretty confused. Help would be much appreciated :)


ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by denilsonjnr60

What do you mean by combining ? Do you need to put everything into one big file ? Depending on what you're actually trying to do, this may not be necessary. A protein-coding gene can lead to the production of different proteins hence a single gene identifier can be linked to multiple protein identifiers. It may depends on the task/question at hand but very often data integration is performed at the gene level so in your case, you would associate the gene ID with all GO terms associated with any of the gene's UniProt IDs. Also if the species you're working with are present in Ensembl, you could retrieve the gene annotations directly from there using BioMart or the perl API.

ADD REPLYlink written 2.1 years ago by Jean-Karim Heriche21k

Yeah, I'd like to create a dataset where each entry is a gene and has PPI information as well as biological processes associated with it. Just to make sure I understood correctly: entrez gene are IDs for genes, while UniProtKB is for proteins? Why does BioGRID mappings list ~47k unique entrez genes for humans, considering humans only have around 20000?

I'll look into the alternatives you suggested. Thanks a lot!! :)

ADD REPLYlink written 2.1 years ago by denilsonjnr60

Yes, Entrez genes are genes and UniProt is for proteins. There are about 20000 protein-coding genes in the human genome. The other genes (about 22k annotated in Ensembl v90) are non-coding, i.e. they do not produce proteins. I don't know where you got that BioGRID has 47k human genes, their statistics page mentions 21.7k unique human genes.

ADD REPLYlink written 2.1 years ago by Jean-Karim Heriche21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1930 users visited in the last hour