Question: Get protein domain information from gene name in R
1
gravatar for asd
3.5 years ago by
asd10
asd10 wrote:

I would like to get the protein domain name, start and end of a gene by its name in R. A Web API is also acceptable.

My goal is to plot DNA mutations on protein domain level, like the cBioPortal MutationMapper, but I would like to do it programmatically in R. I know that this information available in the Pfam database, but I don't know how to get that data.

I have read the previous posts in similar topics, but I didn't find a solution. Thank you for help!

package protein R • 2.5k views
ADD COMMENTlink modified 2.7 years ago by Biostar ♦♦ 20 • written 3.5 years ago by asd10
1
gravatar for Jean-Karim Heriche
3.5 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

You can do this using EnsEMBL. Use either the BioMart interface or the perl API.

EDIT: Forgot the R bit: there's the bioMaRt bioconductor package.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Jean-Karim Heriche23k

Thank you, bioMart returns the required results, but it contains too much row and not just those, which annotated as 'Pfam' and 'low_complexity' on Pfam website.

How can I annotate it with this source and domain column?

ADD REPLYlink written 3.5 years ago by asd10

EnsEMBL bioMart's HTML looks buggy: results are returned per transcript, even if you haven't selected the transcript IDs to be returned and even if you request unique results only. However, exporting unique results as tsv file seems to work as expected.

ADD REPLYlink written 3.5 years ago by Jean-Karim Heriche23k

For TP53 the bioMart unique tsv contains 17 row but the Pfam website just 13. BioMart has domain from 1 to 156, Pfam has 1 to 23.

Why is this difference?

ADD REPLYlink written 3.5 years ago by asd10

It looks like the unique results in the tsv file still contain results corresponding to different transcripts and so likely slightly different proteins. Since you want to locate mutations relative to protein domains, you should anyway consider all proteins produced by a given gene. Note that Pfam has no notion of genes or of underlying genome, it just annotates proteins from UniProt, usually only the canonical sequence, not the variants whereas EnsEMBL does annotate all proteins.

ADD REPLYlink written 3.5 years ago by Jean-Karim Heriche23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1638 users visited in the last hour