Question: convert CDS coordinates of a GTF to amino acid coordinates respect the resultant protein
0
11 days ago by
user230613280
Europe
user230613280 wrote:

Hello! I wonder if there is a way to convert CDS coordinates of a gene, to amino acid based coordinates , see the example:

``````five_prime_utr1 1   495
exon1   496 568
CDS1    496 568
intron1 569 698
exon2   699 968
CDS2    699 968
``````

The protein for this given gene, will start at 496 nucleotide position (CDS1), so 496 will be the position 1 of the resultant protein. What would be the end position (568)? What about the next domain, the one that its CDS starts at 699 (CDS2)?

What I need is to identify the protein subsequence of my full protein that was coded by each CDS, f.e:

ABC DEF GHIJKL MNOPQ RST (full protein sequence, amino acid domains belonging to each CDS in bold)

------ CDS1 ----------- CDS2

------- 3-5---------------13-17

I've found similar questions such as: Python Framework For Converting Genomic To Protein Coordinates

Is there any solution in place already for this problem that seems quite common?

protein coordinates • 119 views
modified 11 days ago • written 11 days ago by user230613280

I am not sure if I fully understand your question but I think it would looklike - 5' UTR -- CDS domain 1 (72 nucleotides, 24 AA) -- intron sequence -- CDS domain 2 (269 nucleotides, this is not a multiple of 3, you may be missing one more nucleotide position?) When this gene is processed - the intron will be clipped off, and the exons (CDS domains) will be merged together. So the end position of first CDS domain would be (568-496)/3 = 72/3 = 24, and the 25th amino acid would be the first amino acid of the second exon

Hi! Sorry if I didn't explain it clearly. In a nutshell, what I need is "simply" to extract the amino acid sequence corresponding to exon 2 and exon 3 of a given gene. And what I have, is a fasta file with the gene sequence, another fasta file with the protein sequence, and a GTF with the annotation of the gene.

1

Ah, so you already have a gene sequence where all exons are merged together and you want to "separate" those exons? And then translate them to their individual amino acid sequences? I think this should be straightforward if you know the annotations, so you know how many exons you have, which one comes first, second.. and what is the length of every exon, then just "splice" the gene sequence according to this information - maybe a short for-loop would do the trick. So once you have separated exons, then "translate" every exon to its amino acid sequence - should be simple with BioPython.