Question

Exon to Protein

1

Entering edit mode

9.9 years ago

allphipsi ▴ 10

I am somewhere close to the solution but still thought to ask it here. I have the start and end position of exons with me (List has >200 separate coordinate information). All I want is to retrieve the protein sequence which is encoded just by the mentioned exon sequence. I would also like to map start and end coordinate information of the "exon encoded peptide" in protein later.

To make it more clear here is one example: The coordinates of exon in genome are

10:13803647:13803669:-1 or chr10:13803647-13803669-

This gene sequence will encode the protein/peptide sequence ==> EAKGDFSS

To solve this problem, I have tried Biomart (ENSEMBL) a lot but it gives me the sequence of full protein from which I can't extract the region of exon I am interested in. I am using the same assembly for the exercise viz. hg19/grch37.

Just to mention I also tried using UCSC table browser although it couldn't map all the regions. Why it is so? Is there any difference in ENSEMBL/UCSC coordinate system? I am aware that some regions might not be coding and hence there might be nothing in output for these regions although not sure much as newly started this kind of work.

(NOTE: As a given exon might be part of different transcripts, then I might find more than one peptide sequence as output from the same exonic region due to the change in frame).

I also tried another quick solution by translating the exon sequences in three frames as I have the information on +/- strand. Although I did not get good solution for all the sequences (NOT all of them code in full length for possible peptide sequence, they get terminated due to stop codon in frame !). Cases with full coding were mapped on the protein sequences which I got by mapping the exon coordinates in Biomart. Although this does not look full proof solution as I am not able to cover all the sequences?

What else I can use here? Is there need to change the strategy? Any insights will be appreciated!

protein genome RNA-Seq R exon • 3.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.9 years ago by allphipsi ▴ 10

0

Entering edit mode

I have the start and end position of exons with me (List has >200 separate coordinate information). All I want is to retrieve the protein sequence which is encoded just by the mentioned exon sequence.

That is not enough; you need to know where the translation starts and ends.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks for the suggestion Pierre. I will try getting CDS start and CDS end information first. If I do have this information with phase information then I can do conceptual translation with accuracy for any CDS related to the exon of my interest. I shall be using UCSC table browser as it provides CDS information along with phase for every exon wherever available. What should I do with cases where I don't have the CDS information for an exon? Did I follow things correctly here?

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.9 years ago by allphipsi ▴ 10

0

Entering edit mode

If there's no CDS or frame then you need to skip the exon. Not all exons are coding, after all.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.9 years ago by Devon Ryan 105k

Ram · Answer 1 · 2015-08-10

1

Entering edit mode

9.9 years ago

Jean-Karim Heriche 27k

The Ensembl perl API has an Exon object with a peptide method. So get a slice using the coordinates you have, then your exon with ExonAdaptor::fetch_all_by_Slice() and the peptide simply with $pept = $exon->peptide($transcript)->seq;

Note that you need a transcript as the translation of the exon can vary between transcripts.

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.9 years ago by Jean-Karim Heriche 27k