I am somewhere close to the solution but still thought to ask it here. I have the start and end position of exons with me (List has >200 separate coordinate information). All I want is to retrieve the protein sequence which is encoded just by the mentioned exon sequence. I would also like to map start and end coordinate information of the "exon encoded peptide" in protein later.
To make it more clear here is one example: The coordinates of exon in genome are
10:13803647:13803669:-1 or chr10:13803647-13803669-
This gene sequence will encode the protein/peptide sequence ==> EAKGDFSS
To solve this problem, I have tried Biomart (ENSEMBL) a lot but it gives me the sequence of full protein from which I can’t extract the region of exon I am interested in. I am using the same assembly for the exercise viz. hg19/grch37.
Just to mention I also tried using UCSC table browser although it couldn't map all the regions. Why it is so ? Is there any difference in ENSEMBL/UCSC coordinate system ? I am aware that some regions might not be coding and hence there might be nothing in output for these regions although not sure much as newly started this kind of work.
(NOTE: As a given exon might be part of different transcripts, then I might find more than one peptide sequence as output from the same exonic region due to the change in frame).
I also tried another quick solution by translating the exon sequences in three frames as I have the information on +/- strand. Although I did not get good solution for all the sequences (NOT all of them code in full length for possible peptide sequence, they get terminated due to stop codon in frame !). Cases with full coding were mapped on the protein sequences which I got by mapping the exon coordinates in Biomart. Although this does not look full proof solution as I am not able to cover all the sequences?
What else I can use here ? Is there need to change the strategy ? Any insights will be appreciated !