Question: Exon to Protein
0
gravatar for allphipsi
4.0 years ago by
allphipsi0
Germany
allphipsi0 wrote:

I am somewhere close to the solution but still thought to ask it here. I have the start and end position of exons with me (List has >200 separate coordinate information). All I want is to retrieve the protein sequence which is encoded just by the mentioned exon sequence.  I would also like to map start and end coordinate information of the "exon encoded peptide" in protein later.  

To make it more clear here is one example: The coordinates of exon in genome are 

10:13803647:13803669:-1 or chr10:13803647-13803669-

This gene sequence will encode the protein/peptide sequence ==> EAKGDFSS 

To solve this problem, I have tried Biomart (ENSEMBL) a lot but it gives me the sequence of full protein from which I can’t extract the region of exon I am interested in. I am using the same assembly for the exercise viz. hg19/grch37.

Just to mention I also tried using UCSC table browser although it couldn't map all the regions. Why it is so ? Is there any difference in ENSEMBL/UCSC coordinate system ? I am aware that some regions might not be coding and hence there might be nothing in output for these regions although not sure much as newly started this kind of work.

(NOTE: As a given exon might be part of different transcripts, then I might find more than one peptide sequence as output from the same exonic region due to the change in frame).

I also tried another quick solution by translating the exon sequences in three frames as I have the information on +/- strand. Although I did not get good solution for all the sequences (NOT all of them code in full length for possible peptide sequence, they get terminated due to stop codon in frame !). Cases with full coding were mapped on the protein sequences which I got by mapping the exon coordinates in Biomart. Although this does not look full proof solution as I am not able to cover all the sequences? 

What else I can use here ? Is there need to change the strategy ? Any insights will be appreciated !

rna-seq protein exon R genome • 2.1k views
ADD COMMENTlink modified 4.0 years ago by Jean-Karim Heriche20k • written 4.0 years ago by allphipsi0

" I have the start and end position of exons with me (List has >200 separate coordinate information). All I want is to retrieve the protein sequence which is encoded just by the mentioned exon sequence."

That is not enough : you need to know where the translation starts and ends.

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum122k

Thanks for the suggestion Pierre. I will try getting CDS start and CDS end information first. If I do have this information with phase information then I can do conceptual translation with accuracy for any CDS related to the exon of my interest. I shall be using UCSC table browser as it provides CDS information along with phase for every exon wherever available. What should I do with cases where I don't have the CDS information for an exon? Did I follow things correctly here? 

ADD REPLYlink written 4.0 years ago by allphipsi0

If there's no CDS or frame then you need to skip the exon. Not all exons are coding, after all.

ADD REPLYlink written 4.0 years ago by Devon Ryan91k
1
gravatar for Jean-Karim Heriche
4.0 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche20k wrote:

The Ensembl perl API has an Exon object with a peptide method. So get a slice using the coordinates you have, then your exon with ExonAdaptor::fetch_all_by_Slice() and the peptide simply with $pept = $exon->peptide($transcript)->seq;

Note that you need a transcript as the translation of the exon can vary between transcripts.

 

 

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Jean-Karim Heriche20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 822 users visited in the last hour